You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Suresh V <su...@gmail.com> on 2022/11/10 21:07:31 UTC

Efficient filtering

Hi .. Right now I am using something like this:

ArrowScanner.from_batches(pa_table.to_batches(), filter=my_expression).

I was wondering if there is a more efficient way to do this filtering if I
have to exclude some of the rows.

As of now I am changing my expression to something like my_expression &
pc.field('row_id').isin(row_ids).

This filter might be actually doing lot of extra work to match the in
clause for the row ids. Is there someway to direct the to batches to
exclude the rows ahead of time based on a boolean mask.

ArrowScanner.from_batches(pa_table.to_batches(my_mask),
filter=my_expression).


Thanks