You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dhruv Vats (Jira)" <ji...@apache.org> on 2022/03/25 08:20:00 UTC
[jira] [Commented] (ARROW-15643) [C++] Kernel to select subset of fields of a StructArray
[ https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512249#comment-17512249 ]
Dhruv Vats commented on ARROW-15643:
------------------------------------
If this has some priority, I'd be happy to work on this. But I'm not clear on what path are we settling on. Do we implement {{struct_subset}} kernel and the casting functionality differently? Or should it be like implement one and use that to implement the other?
> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
> Key: ARROW-15643
> URL: https://issues.apache.org/jira/browse/ARROW-15643
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: kernel
>
> Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
> 1,
> 2,
> 3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with {{struct_field}}, and doing this manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow selecting multiple fields (although that probably gets ambigous/confusing with how you currently select a recursively nested field -> [0, 1] currently means "first child, second subchild" and not "first and second child").
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc [~westonpace])
--
This message was sent by Atlassian Jira
(v8.20.1#820001)