You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/19 16:53:27 UTC

[GitHub] [arrow-rs] avantgardnerio opened a new issue, #3142: AppendableRecordBatch

avantgardnerio opened a new issue, #3142:
URL: https://github.com/apache/arrow-rs/issues/3142

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   
   At Space and Time, we would like to make an OLTP columnar DB based on Arrow (see noisepage for precedent). Unfortunately, Arrow RecordBatches are immutable by design. 
   
   **Describe the solution you'd like**
   
   Fortunately however, append-only and immutable are not necessarily at odds, as long as one takes immutable RecordBatch snapshots at a point-in-time of the append-only version. See [this gist](https://gist.github.com/avantgardnerio/60a53a3481f3d7844efa12346ef6f814) for a working PoC implementation:
   
   ![Screenshot from 2022-11-19 09-46-31](https://user-images.githubusercontent.com/3855243/202862296-8b90220d-0828-411b-b83c-4215d8633b10.png)
   
   **Describe alternatives you've considered**
   
   1. using arrow2
   2. trying some trickery with making the `MutableBuffer` work, vs preallocating ones from the immutable package
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] avantgardnerio commented on issue #3142: AppendableRecordBatch

Posted by GitBox <gi...@apache.org>.

avantgardnerio commented on issue #3142:
URL: https://github.com/apache/arrow-rs/issues/3142#issuecomment-1328161372

   Others might still find it useful, but I've convinced myself with how we are planning to handle MVCC we'll need to make copies anyway, so the builder approach is probably fine (given a non-consuming `.finish()`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #3142: AppendableRecordBatch

Posted by GitBox <gi...@apache.org>.

alamb commented on issue #3142:
URL: https://github.com/apache/arrow-rs/issues/3142#issuecomment-1322692312

   > Fortunately however, append-only and immutable are not necessarily at odds, as long as one takes immutable RecordBatch snapshots at a point-in-time of the append-only version. 
   
   I think the general pattern of 
   * Append data ....
   * Snapshot (by copying into a read only `RecordBatch`)
   * keep appending data
   
   Makes sense with arrow. In fact this is what we do in IOx on the write path. 
   
   This pattern can be done with the various `*Builder`s -- such as  https://docs.rs/arrow/27.0.0/arrow/array/type.Int64Builder.html
   
   So something like:
   
   ```rust
   builder.append_value(1);
   
   // get an array to use in datafusion, etc
   let array = builder.build();
   
   // make a new builder and start building the second batch
   builder = Int64Builder::new();
   builder.append_value(2);
   builder.append_value(3);
   ```
   
   
   
   
   I think what @tustvold  is suggesting with " non-consuming finish method to the builders" means you could do something like:
   
   ```rust
   builder.append_value(1);
   
   // get an array to use in datafusion, etc (by copying the underlying values)
   let array = builder.snapshot();
   
   // existing builder can be used to append 
   builder.append_value(2);
   builder.append_value(3);
   ```
   
   
   I don't fully follow the proposal in https://gist.github.com/avantgardnerio/48d977ea6bd28c790cfb6df09250336d 
   
   As soon as you have this function:
   
   ```rust
       pub fn as_slice(&self) -> RecordBatch {
           self.record_batch.slice(0, self.len())
       }
   ```
   
   This means that now multiple locations may be able to read that data so trying to update as well will likely be an exercise in frustration as you try and fight the rust compiler to teach it that somehow you have guaranteed that the memory remains valid and that concurrent read and modification are ok.
   
   Maybe we could start with the "make a snapshot based copy" and if the copying is too much figure out some different approach.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] tustvold commented on issue #3142: AppendableRecordBatch

Posted by GitBox <gi...@apache.org>.

tustvold commented on issue #3142:
URL: https://github.com/apache/arrow-rs/issues/3142#issuecomment-1321842532

   Like the general idea, just a couple of comments/questions:
   
   * If the extend payload is RecordBatch, what do you gain by concatenating them together? Why not just store them separately and periodically compact them? What do you gain from a single RecordBatch over sayb`Vec<RecordBatch>`?
   * Similar to the above, why is this an AppendableRecordBatch and not say AppendablePrimitiveArray, etc... This would be more flexible and avoid creating array temporaries
   * I'm not sure how support for booleans would work, unless you can only append multiples of 8
   * You need to know the maximum buffer lengths up front, as you can't realloc the buffers
   * Your comment suggests arrow2 supports this but can't see how, could you point me to it?
   
   
   One potentially simpler way to implement something similar to this would be to add a non-consuming finish method to the builders. This would entail copying the buffers, but in my experience of implementing the write path for IOx, this copy is insignificant in the grand scheme of query execution - even ignoring the heavy hitters like sorts and groups, all kernels involve copying values to an output array. This is especially true if after a certain number of rows you rotate the builders and just keep the immutable RecordBatch, thereby bounding the copy. What do you think?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-rs] avantgardnerio commented on issue #3142: AppendableRecordBatch

Posted by GitBox <gi...@apache.org>.

avantgardnerio commented on issue #3142:
URL: https://github.com/apache/arrow-rs/issues/3142#issuecomment-1320956914

   Edit: cleaner API with `.extend(other: RecordBatch)`: https://gist.github.com/avantgardnerio/48d977ea6bd28c790cfb6df09250336d


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org