You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/09/30 20:16:00 UTC

[jira] [Created] (ARROW-14186) [C++][Dataset] Define appropriate abstractions for "fragments" that can handle compute

Weston Pace created ARROW-14186:
-----------------------------------

Summary: [C++][Dataset] Define appropriate abstractions for "fragments" that can handle compute
Key: ARROW-14186
URL: https://issues.apache.org/jira/browse/ARROW-14186
Project: Apache Arrow
Issue Type: Wish
Components: C++
Reporter: Weston Pace

This issue has come up in flight (ARROW-10524) and Skyhook (ARROW-13607). In both cases there is a desire to scan data from remote data sources. In both cases the remote data sources can be capable of essentially running their own query engine. I went ahead and created a JIRA to capture some of the discussion.

So maybe this is a question of "how does the datasets API handle distributed query?" which is maybe a subquestion of "what is the future of the datasets API given richer query frontends?"

If we treat datasets API as a simple query engine frontend limited to scan->filter->project->collect|head|count graphs then filtering can be pushed down (and returned with a guarantee) and projection probably can't be pushed down if there are multiple data sources. Head can be pushed down but not count without some effort.

If we're thinking of the datasets API as a scan node for a more general query engine then I think things get complex rather quickly. I'm not sure if the above rules apply or not. For example, a join might combine data from two different source. A filter that compares columns on both sides of the join could not be pushed down. I'm sure these problems are figured out by more general purpose distributed query engines (which presumably slice the query plan into smaller query plans for each individual node).

--
This message was sent by Atlassian Jira
(v8.3.4#803005)