You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/04 19:07:16 UTC

[GitHub] [arrow] westonpace commented on pull request #10070: ARROW-12231: [C++][Python][Dataset] Differentiate one-shot datasets

westonpace commented on pull request #10070:
URL: https://github.com/apache/arrow/pull/10070#issuecomment-832175816


   > To me, this seems less like a subclass of dataset and more like a subclass of Scanner: IMHO it's not intuitive that a dataset would ever be single-shot. Instead, I think it'd make more sense to add Scanner::MakeFromRecordBatchReader or so, and (probably) add single-shot-ness to the contract of Scanner.
   
   I'm not sure I agree.  I agree with "it's not intuitive that a dataset would ever be single-shot".  I don't agree that it makes any more sense for Scanner to be single-shot.  I think the core non-intuitive piece is the concept of a "one-shot iterable".
   
   In my mental model:
   
   ```
   Dataset -> Iterable<Fragment>
   Fragment -> Map<Fragment, Iterable<RecordBatch>>
   Scanner -> Map<Dataset, Iterable<RecordBatch>>
   ```
   
   So Scanner is just a "map" function which is generally (Python being the exception) reusable.
   
   Perhaps I will revisit my original suggestion of having the input to dataset be an iterable (`InMemoryDataset::RecordBatchGenerator` is already sort of an "iterable" interface) and the in-memory variants are one-shot iterables.  The user facing python API could remain as-is.  list of batches, tables, or iterable of batches, tables would be converted into a `RecordBatchReader` and a one-shot implementation of `InMemoryDataset::RecordBatchGenerator` would consume the reader and then return an invalid status the next time `InMemoryDataset::RecordBatchGenerator::Get` is called.
   
   Although that takes us back pretty close to where we started :grimacing: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org