You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Vibhatha Lakmal Abeykoon (Jira)" <ji...@apache.org> on 2022/07/22 10:31:00 UTC

[jira] [Commented] (ARROW-17183) [C++] Adding ExecNode with Sort and Fetch capability

    [ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569935#comment-17569935 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--------------------------------------------------

Specifically about this case, we assume that this Fetch and Sort operations are the most external relations in a Substrait plan. Meaning the Sort or Fetch operation is called at the end of the other operations. This is not a very accurate representation. First we need to understand if this is the general case. cc [~westonpace] [~jvanstraten] 

There can be a few settings where these two operations are applied. 

*Sort Only*

If the query only has a Sorting operation instead of adding the `SinkNodeConsumer`, we need to add a `OrderBySinkNodeConsumer`. 

*Fetch Only*

If the query only has a Fetch operation, we can include a node with fetch capability. At the moment we don't have a node with Fetch capability, so this may need to be included where we could be able to use a siimilar logic used in `SelectK` node.

*SortAndFetch or FetchAndSort*

If the query contains both sort and fetch in a given order, there has to be a single Sink node which can do this operation by the given order. When scanning the plan components, when we find a sort we just add a `OrderBySink` and keep adding other relations. If we find a Fetch operation, this needs to be replaced with a SortAndFetch operation where sorting is done first and fetching is done next. And this can go vice-versa. 

*Another Approach:*

Another approach is that we define a sink node which can execute a function which does the expected operation. In some of the defined Sink nodes (KSelect, Sort) there is a function called {*}`{*}DoFinish{*}`.{*} We should be able to call a custom function within this call. So from Substrait end when we extract the plan, then we can write the required `std::function` which would be an option for this custom sink node. And we assume that a table as input and write the logic. This way we don't have to introduce new nodes. And what if there are different capabilities users need and ACERO has a limitation, can we always keep adding nodes to fullfil that? I am not so sure. This is just a high level thought process.

Although I have implemented a _SortAndFetch_ node which can perform the fetch followed by a sort just by following what is being done in Sort and SelectK nodes. But I am not exactly sure any of these approaches are optimized or the best way to solve the problem.  

This is the doubtful component which I am not quite clear what would be the most optimize way to include this capability. Or if there is a better way to do this. Appreciate your thoughts. 

cc [~westonpace] [~jvanstraten] [~bkietz] [~icook] 

> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
>                 Key: ARROW-17183
>                 URL: https://issues.apache.org/jira/browse/ARROW-17183
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. For instance pick 10 records skipping the first 5 elements. Here we can define this as a Slice operation and records can be easily extracted in a sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on sorted data. The main issue is we cannot have a sort node followed by a fetch. The reason is that all existing node definitions supporting sort are based on sink nodes. Since there cannot be a node followed by sink, this functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this is define a sink node where the records are sorted and then a set of items are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)