You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by "Lee, David (PAG)" <Da...@blackrock.com> on 2023/05/04 20:56:43 UTC

Cartesian Product Function Help

I'm trying to construct a cartesian result as a pyarrow table using pyarrow compute, but haven't found any elegant way to do this..

Any suggestions?

For inputs :

n_legs = pa.array([2, 4, 5, 100])
animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
names = ["n_legs", "animals"]

Desired output: 16 rows which is the product of 4 elements in n_legs x 4 elements in animals..

>>> final_table
pyarrow.Table
n_legs: int64
animals: string
----
n_legs: [[2,4,5,100,100,...,4,4,5,100,2]]
animals: [["Flamingo","Horse","Brittle stars","Centipede","Flamingo",...,"Centipede","Flamingo","Horse","Brittle stars","Centipede"]]

>>> final_table.to_pylist()
[{'n_legs': 2, 'animals': 'Flamingo'}, {'n_legs': 4, 'animals': 'Horse'}, {'n_legs': 5, 'animals': 'Brittle stars'}, {'n_legs': 100, 'animals': 'Centipede'}, {'n_legs': 100, 'animals': 'Flamingo'}, {'n_legs': 2, 'animals': 'Horse'}, {'n_legs': 4, 'animals': 'Brittle stars'}, {'n_legs': 5, 'animals': 'Centipede'}, {'n_legs': 100, 'animals': 'Flamingo'}, {'n_legs': 5, 'animals': 'Horse'}, {'n_legs': 2, 'animals': 'Brittle stars'}, {'n_legs': 4, 'animals': 'Centipede'}, {'n_legs': 4, 'animals': 'Flamingo'}, {'n_legs': 5, 'animals': 'Horse'}, {'n_legs': 100, 'animals': 'Brittle stars'}, {'n_legs': 2, 'animals': 'Centipede'}]

The above example is a cartesian join between two arrays, but potentially this could a product join between 3, 4 or 5 arrays which may also be different lengths..



This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2023 BlackRock, Inc. All rights reserved.

Re: Cartesian Product Function Help

Posted by Aldrin <oc...@pm.me>.
yeah, I don't really see anything that jumps out at me as being a clear solution. I'm also not sure that you would want that materialized unless your final result was reasonably small.

I don't see acero as having implemented crossrel[1], which would be exactly what you'd want.



My suggestion is essentially to build it yourself, potentially as a compute function. The only other general recommendations I can give would be to use dictionary encoding for string columns, maybe some clever use of run-length encoding, and/or to use generators since you're in python.


Definitely a lackluster answer, but if you would like more direction then sharing your requirements would be really useful.




[1]: https://substrait.io/relations/logical_relations/#cross-product-operation


On Thu, May 4, 2023 at 13:56, Lee, David (PAG) <Da...@blackrock.com> wrote:

> I'm trying to construct a cartesian result as a pyarrow table using pyarrow compute, but haven't found any elegant way to do this..
> 

> Any suggestions?
> 

> For inputs :
> 

> n_legs = pa.array([2, 4, 5, 100])
> animals = pa.array(["Flamingo", "Horse", "Brittle stars", "Centipede"])
> names = ["n_legs", "animals"]
> 

> Desired output: 16 rows which is the product of 4 elements in n_legs x 4 elements in animals..
> 

> >>> final_table
> pyarrow.Table
> n_legs: int64
> animals: string
> ----
> n_legs: [[2,4,5,100,100,...,4,4,5,100,2]]
> animals: [["Flamingo","Horse","Brittle stars","Centipede","Flamingo",...,"Centipede","Flamingo","Horse","Brittle stars","Centipede"]]
> 

> >>> final_table.to_pylist()
> [{'n_legs': 2, 'animals': 'Flamingo'}, {'n_legs': 4, 'animals': 'Horse'}, {'n_legs': 5, 'animals': 'Brittle stars'}, {'n_legs': 100, 'animals': 'Centipede'}, {'n_legs': 100, 'animals': 'Flamingo'}, {'n_legs': 2, 'animals': 'Horse'}, {'n_legs': 4, 'animals': 'Brittle stars'}, {'n_legs': 5, 'animals': 'Centipede'}, {'n_legs': 100, 'animals': 'Flamingo'}, {'n_legs': 5, 'animals': 'Horse'}, {'n_legs': 2, 'animals': 'Brittle stars'}, {'n_legs': 4, 'animals': 'Centipede'}, {'n_legs': 4, 'animals': 'Flamingo'}, {'n_legs': 5, 'animals': 'Horse'}, {'n_legs': 100, 'animals': 'Brittle stars'}, {'n_legs': 2, 'animals': 'Centipede'}]
> 

> The above example is a cartesian join between two arrays, but potentially this could a product join between 3, 4 or 5 arrays which may also be different lengths..
> 

> 

> 

> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
> 

> 

> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
> 

> © 2023 BlackRock, Inc. All rights reserved.