You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "martin-traverse (via GitHub)" <gi...@apache.org> on 2023/03/24 19:14:37 UTC

[GitHub] [arrow] martin-traverse commented on issue #24575: [C++] Implement row range read API for IPC file (and Feather)

martin-traverse commented on issue #24575:
URL: https://github.com/apache/arrow/issues/24575#issuecomment-1483293086

Hello - just wondering if anyone is still thinking about this? We have a data platform where this would be extremely useful.

Looking at the Footer structure, the block information is already in there as a list. Since this structure needs to be read anyway, is it sufficient to e.g. add a recordOffset property to the Block, only meaningful for record batches? A simple solution like this would allow paginated retrieval and processing of moderately large datasets I would think, at least up to a few million batches. Compared to scanning through the batches one by one from cloud storage it would be a big win.

In terms of adding to the language APIs, in our case we are already working at the FB / batch level because we need non-blocking data streams, so just having it in the file format would be enough for us. My guess is that adding it to the language APIs would make it more generally useful though.

For the pre-built index discussed above, I'd think this is only needed if (a) the number of batches is very large, and (b) the arrangement of batches is very asymmetrical (e.g. lots of big batches followed by lots of small batches) and (c) the file is read a lot more often than it is written. Perhaps this index structure could be added later, and fall back to either bisecting the list of blocks or generating an index at read time. Depends I guess, how much extra effort is needed.

We have a very crude solution for now - use a constant batch size and write that value to the custom metadata in the footer for datasets created by our platform. This is really all the functionality we need, the only problem is it's not portable.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org