You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2021/05/26 16:28:00 UTC

[jira] [Comment Edited] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

    [ https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351918#comment-17351918 ] 

Ben Kietzman edited comment on ARROW-12873 at 5/26/21, 4:27 PM:
----------------------------------------------------------------

It's worth noting that without some support for tagging we won't be able to maintain batch order for simple filter+project workflows, which is currently required by ParquetDataset. Currently this is handled in dataset:: by tagging batches with the indices of their origin fragment and their index within that fragment, then stitching batches back into a table which maintains the original batch order. If we don't provide any way to propagate such tags through an ExecPlan then there will be no way to reconstruct an ordered table


was (Author: bkietz):
It's worth noting that without some support for tagging we won't be able to maintain batch order for simple filter+project workflows, which is currently required by ParquetDataset. Currently this is handled in dataset:: by tagging batches with the indices of their origin fragment and their index within that fragment, then stitching batches back into a table which maintains the original batch order.

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-12873
>                 URL: https://issues.apache.org/jira/browse/ARROW-12873
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, since they may not originate from the arrow library. For an example within the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} of origin. However adding {{ExecBatch::fragment}} would result in a cyclic dependency.
> To facilitate this tagging capability, we would need a type erased container something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)