You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Tim Nicolson <ti...@wayflyer.com> on 2021/11/09 20:45:49 UTC

Project nested field from list of structs

Hi,

I have a parquet dataset containing "order" structs each of which has a
list of "item" structs.  I would like to read a subset of the item structs.
e.g.

order_id: int64

...other fields...

items: list<item: struct<item_id: int64, price: int64, ...other fields...>>


# is this/will this be possible?

dataset.to_table(columns=["order_id", "items.item_id", items.price"])


I guess they'd be lists of scalars rather than a list of structs with fewer
fields?

I couldn't see any reference to *lists* in
https://github.com/apache/arrow/pull/11466.

Is this possible or planned?  Is there another way to achieve this?

Thanks in advance,

Tim

Re: Project nested field from list of structs

Posted by Tim Nicolson <ti...@wayflyer.com>.
Super helpful.  I've productionised that - we can strip it out once we can
push it down.

Thanks again David,

Tim

On Thu, Nov 11, 2021 at 1:05 AM David Li <li...@apache.org> wrote:

> Here you go:
> https://gist.github.com/lidavidm/2375cf34ee57fc694ba90d85025ab894
>
> Pasted inline (let's hope the formatting holds up):
>
> import pyarrow as pa
>
> list_of_struct = pa.array([
>     [{"item_id": 0, "price": 100}, {"item_id": 1, "price": 50}],
>     [{"item_id": 10, "price": 20}, None],
>     None
> ], type=pa.list_(pa.struct([
>     pa.field("item_id", pa.int64()),
>     pa.field("price", pa.int64()),
> ])))
>
> # One array per struct field (this incurs some overhead as it may
> # allocate new validity bitmaps)
> subarrays = list_of_struct.values.flatten()
> # The rest of this is just manipulating array container objects
>
> # Validity bitmap, offsets
> buffers = list_of_struct.buffers()[:2]
>
> item_id = pa.ListArray.from_buffers(
>     pa.list_(pa.int64()),
>     len(list_of_struct),
>     buffers,
>     list_of_struct.null_count,
>     list_of_struct.offset,
>     [subarrays[0]])
>
> prices = pa.ListArray.from_buffers(
>     pa.list_(pa.int64()),
>     len(list_of_struct),
>     buffers,
>     list_of_struct.null_count,
>     list_of_struct.offset,
>     [subarrays[1]])
>
> print(item_id)
> print(prices)
>
> -David
>
> On Wed, Nov 10, 2021, at 16:32, Tim Nicolson wrote:
>
> David,
>
> Thanks for the info - glad that this feature is in the pipeline!
>
> I'd really appreciate some pointers on how to efficiently decompose the
> ListArray/StructArray - happy to flesh it out and come back with an example
> for posterity...
>
> Thanks again,
>
> Tim
>
> On Wed, Nov 10, 2021 at 5:20 PM David Li <li...@apache.org> wrote:
>
>
> Hey Tim,
>
> We're still wiring up all the work needed for nested field refs in general
> (see ARROW-14658 [1]). And we haven't listed out what kinds of references
> we want to support. I would say we want to support things that Substrait
> supports [2] and the behavior you describe here appears to correspond to
> "masked complex expression" references there, that said, the way it
> ultimately gets implemented/exposed may be different.
>
> For now, you will have to read the column and then postprocess it yourself
> (this will require you to manually decompose the ListArray/StructArray and
> reconstruct the ListArray - I can work out an example if that would help).
>
> By the way, thank you for the example here - it reminds me that we also
> likely should support pushing down the projection so that we only load the
> necessary leaf nodes in Parquet as well.
>
> [1]: https://issues.apache.org/jira/browse/ARROW-14658
> [2]:
> https://substrait.io/expressions/field_references/#masked-complex-expression
>
> Best,
> David
>
> On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote:
>
> Hi,
>
> I have a parquet dataset containing "order" structs each of which has a
> list of "item" structs.  I would like to read a subset of the item structs.
> e.g.
>
> order_id: int64
>
> ...other fields...
>
> items: list<item: struct<item_id: int64, price: int64, ...other fields...>>
>
>
> # is this/will this be possible?
>
> dataset.to_table(columns=["order_id", "items.item_id", items.price"])
>
>
> I guess they'd be lists of scalars rather than a list of structs with
> fewer fields?
>
> I couldn't see any reference to *lists* in
> https://github.com/apache/arrow/pull/11466.
>
> Is this possible or planned?  Is there another way to achieve this?
>
> Thanks in advance,
>
> Tim
>
>
>
>

Re: Project nested field from list of structs

Posted by David Li <li...@apache.org>.
Here you go: https://gist.github.com/lidavidm/2375cf34ee57fc694ba90d85025ab894

Pasted inline (let's hope the formatting holds up):

import pyarrow as pa

list_of_struct = pa.array([
    [{"item_id": 0, "price": 100}, {"item_id": 1, "price": 50}],
    [{"item_id": 10, "price": 20}, None],
    None
], type=pa.list_(pa.struct([
    pa.field("item_id", pa.int64()),
    pa.field("price", pa.int64()),
])))

# One array per struct field (this incurs some overhead as it may
# allocate new validity bitmaps)
subarrays = list_of_struct.values.flatten()
# The rest of this is just manipulating array container objects

# Validity bitmap, offsets
buffers = list_of_struct.buffers()[:2]

item_id = pa.ListArray.from_buffers(
    pa.list_(pa.int64()),
    len(list_of_struct),
    buffers,
    list_of_struct.null_count,
    list_of_struct.offset,
    [subarrays[0]])

prices = pa.ListArray.from_buffers(
    pa.list_(pa.int64()),
    len(list_of_struct),
    buffers,
    list_of_struct.null_count,
    list_of_struct.offset,
    [subarrays[1]])

print(item_id)
print(prices)

-David

On Wed, Nov 10, 2021, at 16:32, Tim Nicolson wrote:
> David, 
> 
> Thanks for the info - glad that this feature is in the pipeline!  
> 
> I'd really appreciate some pointers on how to efficiently decompose the ListArray/StructArray - happy to flesh it out and come back with an example for posterity...
> 
> Thanks again,
> 
> Tim 
> 
> On Wed, Nov 10, 2021 at 5:20 PM David Li <li...@apache.org> wrote:
>> __
>> Hey Tim,
>> 
>> We're still wiring up all the work needed for nested field refs in general (see ARROW-14658 [1]). And we haven't listed out what kinds of references we want to support. I would say we want to support things that Substrait supports [2] and the behavior you describe here appears to correspond to "masked complex expression" references there, that said, the way it ultimately gets implemented/exposed may be different. 
>> 
>> For now, you will have to read the column and then postprocess it yourself (this will require you to manually decompose the ListArray/StructArray and reconstruct the ListArray - I can work out an example if that would help).
>> 
>> By the way, thank you for the example here - it reminds me that we also likely should support pushing down the projection so that we only load the necessary leaf nodes in Parquet as well.
>> 
>> [1]: https://issues.apache.org/jira/browse/ARROW-14658
>> [2]: https://substrait.io/expressions/field_references/#masked-complex-expression
>> 
>> Best,
>> David
>> 
>> On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote:
>>> Hi, 
>>> 
>>> I have a parquet dataset containing "order" structs each of which has a list of "item" structs.  I would like to read a subset of the item structs. e.g.
>>> 
>>> order_id: int64
>>> ...other fields...
>>> items: list<item: struct<item_id: int64, price: int64, ...other fields...>>
>>> 
>>> # is this/will this be possible?
>>> dataset.to_table(columns=["order_id", "items.item_id", items.price"])
>>> 
>>> I guess they'd be lists of scalars rather than a list of structs with fewer fields?
>>> 
>>> I couldn't see any reference to *lists* in https://github.com/apache/arrow/pull/11466. 
>>> 
>>> Is this possible or planned?  Is there another way to achieve this?
>>> 
>>> Thanks in advance, 
>>> 
>>> Tim
>> 

Re: Project nested field from list of structs

Posted by Tim Nicolson <ti...@wayflyer.com>.
David,

Thanks for the info - glad that this feature is in the pipeline!

I'd really appreciate some pointers on how to efficiently decompose the
ListArray/StructArray - happy to flesh it out and come back with an example
for posterity...

Thanks again,

Tim

On Wed, Nov 10, 2021 at 5:20 PM David Li <li...@apache.org> wrote:

> Hey Tim,
>
> We're still wiring up all the work needed for nested field refs in general
> (see ARROW-14658 [1]). And we haven't listed out what kinds of references
> we want to support. I would say we want to support things that Substrait
> supports [2] and the behavior you describe here appears to correspond to
> "masked complex expression" references there, that said, the way it
> ultimately gets implemented/exposed may be different.
>
> For now, you will have to read the column and then postprocess it yourself
> (this will require you to manually decompose the ListArray/StructArray and
> reconstruct the ListArray - I can work out an example if that would help).
>
> By the way, thank you for the example here - it reminds me that we also
> likely should support pushing down the projection so that we only load the
> necessary leaf nodes in Parquet as well.
>
> [1]: https://issues.apache.org/jira/browse/ARROW-14658
> [2]:
> https://substrait.io/expressions/field_references/#masked-complex-expression
>
> Best,
> David
>
> On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote:
>
> Hi,
>
> I have a parquet dataset containing "order" structs each of which has a
> list of "item" structs.  I would like to read a subset of the item structs.
> e.g.
>
> order_id: int64
>
> ...other fields...
>
> items: list<item: struct<item_id: int64, price: int64, ...other fields...>>
>
>
> # is this/will this be possible?
>
> dataset.to_table(columns=["order_id", "items.item_id", items.price"])
>
>
> I guess they'd be lists of scalars rather than a list of structs with
> fewer fields?
>
> I couldn't see any reference to *lists* in
> https://github.com/apache/arrow/pull/11466.
>
> Is this possible or planned?  Is there another way to achieve this?
>
> Thanks in advance,
>
> Tim
>
>
>

Re: Project nested field from list of structs

Posted by David Li <li...@apache.org>.
Hey Tim,

We're still wiring up all the work needed for nested field refs in general (see ARROW-14658 [1]). And we haven't listed out what kinds of references we want to support. I would say we want to support things that Substrait supports [2] and the behavior you describe here appears to correspond to "masked complex expression" references there, that said, the way it ultimately gets implemented/exposed may be different. 

For now, you will have to read the column and then postprocess it yourself (this will require you to manually decompose the ListArray/StructArray and reconstruct the ListArray - I can work out an example if that would help).

By the way, thank you for the example here - it reminds me that we also likely should support pushing down the projection so that we only load the necessary leaf nodes in Parquet as well.

[1]: https://issues.apache.org/jira/browse/ARROW-14658
[2]: https://substrait.io/expressions/field_references/#masked-complex-expression

Best,
David

On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote:
> Hi, 
> 
> I have a parquet dataset containing "order" structs each of which has a list of "item" structs.  I would like to read a subset of the item structs. e.g.
> 
> order_id: int64
> ...other fields...
> items: list<item: struct<item_id: int64, price: int64, ...other fields...>>
> 
> # is this/will this be possible?
> dataset.to_table(columns=["order_id", "items.item_id", items.price"])
> 
> I guess they'd be lists of scalars rather than a list of structs with fewer fields?
> 
> I couldn't see any reference to *lists* in https://github.com/apache/arrow/pull/11466. 
> 
> Is this possible or planned?  Is there another way to achieve this?
> 
> Thanks in advance, 
> 
> Tim