You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Ben Kietzman (Jira)" <ji...@apache.org> on 2021/06/01 16:21:00 UTC

[jira] [Commented] (ARROW-12873) [C++][Compute] Support tagging ExecBatches with arbitrary extra information

    [ https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17355203#comment-17355203 ] 

Ben Kietzman commented on ARROW-12873:
--------------------------------------

> You mention "since they may not originate from the arrow library".  Do you have an example of that?

libarrow_dataset is currently separate from libarrow

> It seems pretty straightforward that an exec batch would have a partition expression associated with it.

That seems reasonable, and since Expression is in the compute:: namespace these days we can attach {{Expression ExecBatch::guarantee}} without much trouble.

> One thing I am wondering about, which is maybe a bit academic, is what are the rules of preserving tags that operators (ExecNodes) must follow.

That's an excellent point; it's not at all clear what nodes should do with batch-level tags if the node doesn't have a straightforward 1-in-1-out process. For now in https://github.com/apache/arrow/pull/10397 I'm pursuing a workaround of augmenting scanned batches with virtual columns which encode their fragment and batch indices, which can be used to reorder batches in a ToTable operation.

Perhaps something similar could be more generally applicable to the problem of tagging batches: users who need to tag batches can augment with virtual columns containing keys into a table which they maintain.

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-12873
>                 URL: https://issues.apache.org/jira/browse/ARROW-12873
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, since they may not originate from the arrow library. For an example within the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} of origin. However adding {{ExecBatch::fragment}} would result in a cyclic dependency.
> To facilitate this tagging capability, we would need a type erased container something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)