You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/06 00:33:22 UTC

[GitHub] [arrow] kylebarron opened a new pull request, #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

kylebarron opened a new pull request, #13076:
URL: https://github.com/apache/arrow/pull/13076

   Haven't quite gotten this to work yet, attempting with a sample IPC file from PyArrow, but putting up for visibility.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kylebarron commented on a diff in pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by GitBox <gi...@apache.org>.
kylebarron commented on code in PR #13076:
URL: https://github.com/apache/arrow/pull/13076#discussion_r1037457502


##########
js/src/ipc/reader.ts:
##########
@@ -352,12 +353,24 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = any> implements RecordB
         return this;
     }
 
-    protected _loadRecordBatch(header: metadata.RecordBatch, body: any) {
+    protected _loadRecordBatch(header: metadata.RecordBatch, body: Uint8Array) {
+        if (header.compression) {
+            const codec = compressionRegistry.get(header.compression);
+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65

Review Comment:
   Thanks for doing some digging! This PR isn't currently at the top of my list of things to work on, but would love for this support to be added to arrow JS!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kylebarron commented on pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by "kylebarron (via GitHub)" <gi...@apache.org>.
kylebarron commented on PR #13076:
URL: https://github.com/apache/arrow/pull/13076#issuecomment-1708371049

   > Do you plan on continuing this in the next weeks?
   
   No I have no imminent plans to work on this, but I'd love to see support for this added.
   
   > what's left to do in your opinion
   
   - I never got this actually working at runtime; you probably need to consult the IPC spec closer to see if I'm decompressing the wrong byte range, as suggested in the comment above.
   
       I don't have any specific suggestions on this because I haven't looked at it lately.
   
   - The codec API probably needs to have some consensus on what that should look like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kitsunde commented on a diff in pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by GitBox <gi...@apache.org>.
kitsunde commented on code in PR #13076:
URL: https://github.com/apache/arrow/pull/13076#discussion_r1028207049


##########
js/src/ipc/reader.ts:
##########
@@ -352,12 +353,24 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = any> implements RecordB
         return this;
     }
 
-    protected _loadRecordBatch(header: metadata.RecordBatch, body: any) {
+    protected _loadRecordBatch(header: metadata.RecordBatch, body: Uint8Array) {
+        if (header.compression) {
+            const codec = compressionRegistry.get(header.compression);
+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65

Review Comment:
   I started digging into this although being unfamiliar to arrow and zstd both it has been a bit of a struggle.
   
   The first 8 bytes are not part of the compressed section. You can see it corrected for in other places like here: https://github.com/apache/arrow/pull/9137/files#diff-38c3da9d6c7d521df83f59ee7d8cfced00951e8dcd38f39c7c36d26b6feee3d6R75
   
   If you encode something with ZSTD you'll see it starts with `[40, 181, 47, 253, 32]` which comes after those 8 bytes.  I'e just encoding:
   
   ```
   const testVal = compress(Buffer.from(new Uint8Array([1, 0, 0, 0, 0, 0, 0, 0])));
   
   Uint8Array(17) [
     40, 181, 47, 253, 32, 8, 65,
      0,  0,  1,   0,  0, 0,  0,
      0,  0,  0
   ]
   ```
   
   This within the body here is:
   
   ```
   const compressedMessage = new Uint8Array([
     8, 0, 0, 0,
     0, 0, 0, 0,
     40, 181, 47,
     253, 32, 8, 65,
     0, 0, 1, 0,
     0, 0, 0, 0,
     0, 0, 0, 0,
     0, 0, 0, 0,
     0
   ]);
   ```
   
   Now there's another issue where which is there appears to be padding at the end and between the fields which I'm very uncertain on how to detect besides brute forcing in reverse between the LZ4 header and the end.
   
   ```
   console.log(decompress(compressedMessage.slice(8, 17 + 8)));
   
   Uint8Array([
     40, 181, 47, 253, 
     32, 8, 65, 0,
     0, 1, 0, 0,
     0, 0, 0, 0,
     0
   ]);
   ```
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] isaacbrodsky commented on a diff in pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by GitBox <gi...@apache.org>.
isaacbrodsky commented on code in PR #13076:
URL: https://github.com/apache/arrow/pull/13076#discussion_r868559771


##########
js/package.json:
##########
@@ -99,6 +99,7 @@
     "jest": "27.5.1",
     "jest-silent-reporter": "0.5.0",
     "lerna": "4.0.0",
+    "lz4js": "0.2.0",

Review Comment:
   This is intended to only be registered in tests, and users need to declare the dependency and register in their own applications if they want?
   
   We should include code snippets and instructions for how to do that for each codec needed to support the full Arrow spec. The error message could even link to the docs on exactly how to enable a given compression codec.



##########
js/src/ipc/reader.ts:
##########
@@ -352,12 +353,24 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = any> implements RecordB
         return this;
     }
 
-    protected _loadRecordBatch(header: metadata.RecordBatch, body: any) {
+    protected _loadRecordBatch(header: metadata.RecordBatch, body: Uint8Array) {
+        if (header.compression) {
+            const codec = compressionRegistry.get(header.compression);
+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65

Review Comment:
   Confusing that here the uint8array is the "body" but it might correspond to the "buffer" in that spec. I'm not sure off hand which is actually here?



##########
js/src/ipc/reader.ts:
##########
@@ -352,12 +353,24 @@ abstract class RecordBatchReaderImpl<T extends TypeMap = any> implements RecordB
         return this;
     }
 
-    protected _loadRecordBatch(header: metadata.RecordBatch, body: any) {
+    protected _loadRecordBatch(header: metadata.RecordBatch, body: Uint8Array) {
+        if (header.compression) {
+            const codec = compressionRegistry.get(header.compression);
+            if (codec?.decode) {
+                // TODO: does this need to be offset by 8 bytes? Since the uncompressed length is
+                // written in the first 8 bytes:
+                // https://github.com/apache/arrow/blob/1fc251f18d5b48f0c9fe8af8168237e7e6d05a45/format/Message.fbs#L59-L65
+                body = codec.decode(body);
+            } else {
+                throw new Error('Record batch is compressed but codec not found');

Review Comment:
   Should this error include the codec ID?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13076:
URL: https://github.com/apache/arrow/pull/13076#issuecomment-1119163084

   :warning: Ticket **has not been started in JIRA**, please click 'Start Progress'.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] metalmatze commented on pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by "metalmatze (via GitHub)" <gi...@apache.org>.
metalmatze commented on PR #13076:
URL: https://github.com/apache/arrow/pull/13076#issuecomment-1708121489

   Hey @kylebarron, super nice work on this PR so far! 
   We are very interested in getting this support into Arrow JS. 
   Do you plan on continuing this in the next weeks? 
   
   If you're busy with other things, can you briefly elaborate on what's left to do in your opinion such that we may consider moving it forward from our side.
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13076: ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13076:
URL: https://github.com/apache/arrow/pull/13076#issuecomment-1119163077

   https://issues.apache.org/jira/browse/ARROW-8674


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org