You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/01 15:34:51 UTC

[GitHub] [arrow] pitrou commented on a change in pull request #9656: ARROW-11772: [C++] Provide reentrant IPC file reader

pitrou commented on a change in pull request #9656:
URL: https://github.com/apache/arrow/pull/9656#discussion_r605753277



##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1022,7 +1097,29 @@ class RecordBatchFileReaderImpl : public RecordBatchFileReader {
 
   ReadStats stats() const override { return stats_; }
 
+  Result<AsyncGenerator<std::shared_ptr<RecordBatch>>> GetRecordBatchGenerator(
+      int readahead_messages, const io::IOContext& io_context,
+      arrow::internal::Executor* executor) override {
+    if (readahead_messages < 1) {
+      return Status::Invalid("readahead_messages must be greater than 0");
+    }
+    auto state = std::dynamic_pointer_cast<RecordBatchFileReaderImpl>(shared_from_this());
+    auto messages = MakeMessageIterator(state);
+    auto q_restart = std::max(1, readahead_messages / 2);
+    ARROW_ASSIGN_OR_RAISE(
+        auto message_generator,
+        MakeBackgroundGenerator(std::move(messages), io_context.executor(),
+                                readahead_messages, q_restart));

Review comment:
       Ok, I don't think this is the right approach. Basically you are taking a blocking message reader (`ReadMessageFromBlock`) and pushing it to a thread pool to hide latencies. But when reading a message from a mmap'ed file, you are adding the thread pool overhead to a very small execution cost.
   
   Instead, there should be a `ReadMessageFromBlockAsync` that returns a `Future<Message>`. It's up to the underlying IO object (memory mapped file, S3 object...) to decide whether the async read goes to a thread pool or is synchronous.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org