You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/14 01:58:00 UTC
[jira] [Created] (ARROW-15333) [C++] Specifying all fields as included_fields is slower on an mmap file than omitting included_fields
Weston Pace created ARROW-15333:
-----------------------------------
Summary: [C++] Specifying all fields as included_fields is slower on an mmap file than omitting included_fields
Key: ARROW-15333
URL: https://issues.apache.org/jira/browse/ARROW-15333
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
This turned up when I was working on the IPC read/write benchmarks.
Given an mmap'd IPC file with 1 column these two operations should be identical:
{code}
ipc::IpcReadOptions options;
options.included_fields = {0}
auto reader = *ipc::RecordBatchFileReader::Open(input.get(), options);
{code}
{code}
ipc::IpcReadOptions options;
options.included_fields = {}
auto reader = *ipc::RecordBatchFileReader::Open(input.get(), options);
{code}
However, the latter is ~10 times faster than the former. We should detect when we are specifying all the fields and fallback to a "load all fields" behavior.
Benchmark results:
{noformat}
ReadMmapCachedFile/num_cols:1/is_partial:0/real_time 125726 ns 125677 ns 4359 bytes_per_second=124.278G/s
ReadMmapCachedFile/num_cols:1/is_partial:1/real_time 1404416 ns 1403848 ns 429 bytes_per_second=11.1256G/s
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)