You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Apache Arrow JIRA Bot (Jira)" <ji...@apache.org> on 2022/10/13 17:52:00 UTC

[jira] [Assigned] (ARROW-16409) [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API

     [ https://issues.apache.org/jira/browse/ARROW-16409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Arrow JIRA Bot reassigned ARROW-16409:
---------------------------------------------

    Assignee:     (was: Weston Pace)

> [C++][Python][R] Deprecate "scanner" (but keep "scan node") from public API
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-16409
>                 URL: https://issues.apache.org/jira/browse/ARROW-16409
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> The scanner, in its original form, was something of a prototype query engine.  It handled complex projection (beyond just casting) and filtering.  Over time features have been moved out of the scanner and into the execution engine to the point that the scanner now is just a tool for scanning multiple files simultaneously to feed as input to an exec plan (i.e. "scan node").
> The concept of a "scanner" should mostly be removed from our public API surface.  Those working directly with the execution engine will still need to know about the scan node but that should be about it.
> For example, in python we have pages [like this|https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Scanner.html] and code like this:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> scanner = dataset.scanner(columns=['x'])
> ds.write_dataset(scanner, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Over time I think this will lead to confusion.  It's already a little convoluted.  For example, a call to {{dataset.to_table(...)}} creates a {{Scanner}} and calls {{ToTable}} with {{ScanOptions}}.  This method then creates an {{ExecPlan}} and, in order to do so, must create a {{ScanNode}}.  The {{ScanNode}} consumes some (but not all) of the options in {{ScanOption}} while the {{ExecPlan}} consumes the rest.
> The {{Scanner}} (if one continues to exist) should be an internal detail not visible to users.  The previous code could either change to use a new term {{query}}:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> query = dataset.query(columns=['x'])
> ds.write_dataset(query, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> Or we could use the record batch reader concept:
> {noformat}
> dataset = ds.dataset('/tmp/my_dataset', format='parquet')
> record_batch_reader = dataset.to_reader(columns=['x'])
> ds.write_dataset(record_batch_reader, '/tmp/my_new_dataset', format='parquet')
> {noformat}
> I would like to make some changes to the scanner in 9.0.0 and would hope to address this then so I'm happy to hear opinions / thoughts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)