You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/01/14 21:00:00 UTC

[jira] [Commented] (ARROW-13340) [C++][Dataset] Simplify ScanOptions after complexity has moved to ScanNode

    [ https://issues.apache.org/jira/browse/ARROW-13340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476424#comment-17476424 ] 

Weston Pace commented on ARROW-13340:
-------------------------------------

I think this is a good idea (some of the discussion in ARROW-12311 is related)

One challenge is that ScanOptions is also used for the "convenience methods" like Scanner::Scan and Scanner::Head.  These methods construct an ExecPlan (the fact that they construct a scanner is incidental so it's probably not quite correct they are Scanner methods) and so they need to take in a projection.

One way to tackle that problem would be to create QueryOptions which has ScanOptions + projection (and maybe other stuff down the road).  Another way to handle it would be to modify Scanner::Scan and Scanner::Head to take projection as an optional argument.

> [C++][Dataset] Simplify ScanOptions after complexity has moved to ScanNode
> --------------------------------------------------------------------------
>
>                 Key: ARROW-13340
>                 URL: https://issues.apache.org/jira/browse/ARROW-13340
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>              Labels: dataset
>             Fix For: 8.0.0
>
>
> ScanOptions currently has a number of constraints between members, which violates the contract of a public struct:
> - {{filter}} must be bound to {{dataset_schema}}
> - {{projection}} must be bound to {{dataset_schema}}
> - {{projected_schema}} must be {{schema<...fields>}}, where the type of projection is {{struct<...fields>}}
> These are currently required to support {{FilterAndProjectScanTask}}, but after ARROW-13328 this complexity can be removed and ScanOptions can be a pure struct argument to {{MakeScanNode}}. Specifically, it should be possible to:
> - remove the {{projected_schema}} field (ScanNode doesn't need to know the schemas of any subsequent nodes)
> - remove the {{projection}} field (ScanNode doesn't need to know how or if scanned batches will be projected)
> - provide a simple vector of {{FieldRef}} to indicate which fields should be materialized (MakeScanNode can validate that this includes every field referenced by {{filter}})
> - allow {{filter}} to be unbound (MakeScanNode can bind it to the dataset schema)
> {{dataset_schema}} seems slightly redundant too since MakeScanNode also takes a Dataset as an argument but it is currently used by CsvFileFormat to derive column types



--
This message was sent by Atlassian Jira
(v8.20.1#820001)