You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2022/10/13 15:19:00 UTC

[jira] [Created] (ARROW-18037) [C++] Acero/dataset relies on ExecBatch::ToRecordBatch truncating excess columns

Antoine Pitrou created ARROW-18037:
--------------------------------------

             Summary: [C++] Acero/dataset relies on ExecBatch::ToRecordBatch truncating excess columns
                 Key: ARROW-18037
                 URL: https://issues.apache.org/jira/browse/ARROW-18037
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Antoine Pitrou


As found while working on ARROW-18004: the dataset scanner and the Acero engine rely on {{ExecBatch::ToRecordBatch}} returning successfully when the given schema has fewer fields than the ExecBatch has columns.

This apparently allows to implicitly drop the dataset-added columns ({{kAugmentedFields}} in {{arrow/dataset/scanner.cc}}) from a scan's final result.

However, it seems wrong and brittle to do this implicitly at the {{ExecBatch::ToRecordBatch}} level (hiding potential errors). Instead, it should probably be done explicitly inside Acero/dataset.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)