You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "emcake (via GitHub)" <gi...@apache.org> on 2023/06/13 16:20:13 UTC

[GitHub] [arrow-rs] emcake opened a new issue, #4409: Sliced record batches containing lists produce infeasibly large IPC file

emcake opened a new issue, #4409:
URL: https://github.com/apache/arrow-rs/issues/4409

   **Describe the bug**
   When slicing down a record batch to a subset, the batch shows the correct number of rows. When serializing it via the File IPC writer, the size in bytes of the 'file' is quite large in comparison to the amount of content. While I wouldn't expect it to be linear in the size of the table (given overhead and potentially compression) the results do seem to be quite large, even for one record.
   
   **To Reproduce**
   This test shows serializing a slice, and how large the slice is:
   
   ```rust
       #[test]
       fn encode_list_length() {
           let val_inner = Field::new("item", DataType::UInt32, true);
           let val_list_field = Field::new("val", DataType::List(Arc::new(val_inner)), false);
   
           let schema = Arc::new(Schema::new(vec![val_list_field]));
   
           let values = {
               let u32 = UInt32Builder::new();
               let mut ls = ListBuilder::new(u32);
   
               for i in 0..100000 {
                   for value in vec![i, i, i] {
                       ls.values().append_value(value);
                   }
                   ls.append(true)
               }
   
               ls.finish()
           };
   
           let batch = RecordBatch::try_new(Arc::clone(&schema), vec![Arc::new(values)]).unwrap();
   
           fn serialize_batch(rb: &RecordBatch) -> Vec<u8> {
               let mut writer = FileWriter::try_new(Vec::<u8>::new(), &rb.schema()).unwrap();
               writer.write(&rb).unwrap();
               writer.finish().unwrap();
               let data = writer.into_inner().unwrap();
   
               data
           }
   
           let full_batch = serialize_batch(&batch);
   
           println!(
               "full batch = {} rows, {} bytes",
               batch.num_rows(),
               full_batch.len()
           );
   
           let sliced = batch.slice(999, 1); // slice out 1 row
   
           assert_eq!(sliced.num_rows(), 1); // confirm only 1 row
   
           let sliced_batch = serialize_batch(&sliced);
   
           println!(
               "sliced batch = {} rows, {} bytes",
               sliced.num_rows(),
               sliced_batch.len()
           );
   
           assert!(sliced_batch.len() < (full_batch.len() / 10)); // serializing 1 row should be significantly smaller than serializing 100000
       }
   ```
   
   Produces:
   ```
   full batch = 100000 rows, 1650646 bytes
   sliced batch = 1 rows, 1238150 bytes
   ```
   and fails since the sliced batch is quite large.
   
   **Expected behavior**
   The size to serialize a batch of 1 row should be a lot smaller than 100k rows.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4409: IPC Writer Truncate Sliced List Values

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4409:
URL: https://github.com/apache/arrow-rs/issues/4409#issuecomment-1589686534

   Currently the IPC writer faithfully writes the encoding of the list in memory, unfortunately the nature of the list encoding is such that slicing doesn't propagate to the values themselves. #2040 added logic to handle this for variable size byte array types, it should be a relatively straightforward PR to do something similar for ListArray, LargeListArray and MapArray


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] IPC Writer Truncate Sliced List Values [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4409: IPC Writer Truncate Sliced List Values
URL: https://github.com/apache/arrow-rs/issues/4409


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org