You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/14 01:58:00 UTC

[jira] [Created] (ARROW-15333) [C++] Specifying all fields as included_fields is slower on an mmap file than omitting included_fields

Weston Pace created ARROW-15333:
-----------------------------------

             Summary: [C++] Specifying all fields as included_fields is slower on an mmap file than omitting included_fields
                 Key: ARROW-15333
                 URL: https://issues.apache.org/jira/browse/ARROW-15333
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


This turned up when I was working on the IPC read/write benchmarks.

Given an mmap'd IPC file with 1 column these two operations should be identical:

{code}
ipc::IpcReadOptions options;
options.included_fields = {0}
auto reader = *ipc::RecordBatchFileReader::Open(input.get(), options);
{code}

{code}
ipc::IpcReadOptions options;
options.included_fields = {}
auto reader = *ipc::RecordBatchFileReader::Open(input.get(), options);
{code}

However, the latter is ~10 times faster than the former.  We should detect when we are specifying all the fields and fallback to a "load all fields" behavior.

Benchmark results:

{noformat}
ReadMmapCachedFile/num_cols:1/is_partial:0/real_time               125726 ns       125677 ns         4359 bytes_per_second=124.278G/s
ReadMmapCachedFile/num_cols:1/is_partial:1/real_time              1404416 ns      1403848 ns          429 bytes_per_second=11.1256G/s
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)