You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Elad Rosenheim <el...@dynamicyield.com> on 2021/05/21 14:05:55 UTC

Filtering list/map arrays

Hi!

One of the gaps I currently have in Funnel Rocket (
https://github.com/DynamicYieldProjects/funnel-rocket) is supporting nested
columns, as in: given a Parquet file with a column of type List(int64), be
able to find rows where the list holds a specific int element.

Right now, the need is fortunately limited to lists of primitives (mostly
int) and maps of string->string, rather than any arbitrary complexity.

Currently, I load Parquet files via pyarrow, then call to_pandas() and run
multiple filters on the DataFrame.

After reading Uwe's blog post (
https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html)
and looking at the Fletcher project (https://github.com/xhochy/fletcher),
seems the "proper" way to do it would be:

* Write an ExtensionDType/ExtensionArray can wrap an arrow ChunkedArray
made of ListArrays. Not even sure what the operator should be for lookup in
a list - should I treat a list_series==123 as "for each list in this
series, look for the element 123 in it?".

 * Potentially use a @jitclass for more performant lookup, as Uwe has
outlined.

* For now, for any abstract method I'm not sure what to do with - start
with raising an exception, then run some unit tests based on my project's
needs, and see that they pass :-/

* When calling Table.to_pandas(), supply a type mapper argument to map the
specific supported types to the appropriate extension class.

* If it seems to work, figure out if I've missed something important in the
concrete classes :-/

Am I getting this right, more or less?

Thanks a lot,
Elad

Re: Filtering list/map arrays

Posted by Wes McKinney <we...@gmail.com>.
I think we would want to implement a scalar "list_isin" function as a
core C++ function, so the type signature looks like this:

(Array<List<T>>, Scalar<T>) -> Array<Boolean>

I couldn't find an issue like this with a quick Jira search so I created

https://issues.apache.org/jira/browse/ARROW-12849

On Fri, May 21, 2021 at 8:06 AM Elad Rosenheim <el...@dynamicyield.com> wrote:
>
> Hi!
>
> One of the gaps I currently have in Funnel Rocket (https://github.com/DynamicYieldProjects/funnel-rocket) is supporting nested columns, as in: given a Parquet file with a column of type List(int64), be able to find rows where the list holds a specific int element.
>
> Right now, the need is fortunately limited to lists of primitives (mostly int) and maps of string->string, rather than any arbitrary complexity.
>
> Currently, I load Parquet files via pyarrow, then call to_pandas() and run multiple filters on the DataFrame.
>
> After reading Uwe's blog post (https://uwekorn.com/2018/08/03/use-numba-to-work-with-apache-arrow-in-pure-python.html) and looking at the Fletcher project (https://github.com/xhochy/fletcher), seems the "proper" way to do it would be:
>
> * Write an ExtensionDType/ExtensionArray can wrap an arrow ChunkedArray made of ListArrays. Not even sure what the operator should be for lookup in a list - should I treat a list_series==123 as "for each list in this series, look for the element 123 in it?".
>
>  * Potentially use a @jitclass for more performant lookup, as Uwe has outlined.
>
> * For now, for any abstract method I'm not sure what to do with - start with raising an exception, then run some unit tests based on my project's needs, and see that they pass :-/
>
> * When calling Table.to_pandas(), supply a type mapper argument to map the specific supported types to the appropriate extension class.
>
> * If it seems to work, figure out if I've missed something important in the concrete classes :-/
>
> Am I getting this right, more or less?
>
> Thanks a lot,
> Elad