You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Niklas B <ni...@enplore.com> on 2020/10/01 13:00:55 UTC

Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

Hi,

I have an in-memory dataset from Plasma that I need to filter before running `to_pandas()`. It’s a very text heavy dataset with a lot of rows and columns (only about 30% of which is applicable for any operation). Now I know that you use DNF filters to filter a parquet file before reading to memory. I’m now trying to do the same for my pa.Table that is already in memory. https://issues.apache.org/jira/browse/ARROW-7945 <https://issues.apache.org/jira/browse/ARROW-7945> indicates that should be possible to unify datasets after constructions, but my arrow skills aren’t quite there yet. 

> […]
> [data] = plasma_client.get_buffers([object_id], timeout_ms=100)
> buffer = pa.BufferReader(data)
> reader = pa.RecordBatchStreamReader(buffer)
> record_batch = pa.Table.from_batches(reader)


I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset does it (http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>). My thinking is that if I can cast my table to a UnionDataset I can use the same _filter() code as ParquetDataset does. But I’m not having any luck with my (albeit very naive) approach:

> >>> pyarrow.dataset.UnionDataset(table.schema, [table])

But that just gives me:

> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "pyarrow/_dataset.pyx", line 429, in pyarrow._dataset.UnionDataset.__init__
> TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset


I’m guessing somewhere deep in https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py <https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py> shows an example on how to make a Dataset out of a table, but I’m not finding it.

Any help would be greatly appreciated :)

Regards,
Niklas

Re: Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

Posted by Niklas B <ni...@enplore.com>.
Amazing, just what I need. Thank you so much!

> On 1 Oct 2020, at 15:48, Joris Van den Bossche <jo...@gmail.com> wrote:
> 
> Hi Niklas,
> 
> In the datasets project, there is indeed the notion of an in-memory dataset
> (from a RecordBatch or Table), however, constructing such a dataset is
> currently not directly exposed in Python (except fro writing it).
> 
> But, for RecordBatch/Table objects, you can also directly filter those with
> a boolean mask. A small example with a Table (but will work the same with
> RecordBatch):
> 
>>>> table = pa.table({'a': range(5), 'b': [1, 2, 1, 2, 1]})
>>>> table.filter(np.array([True, False, False, False, True]))
> pyarrow.Table
> a: int64
> b: int64
>>>> table.filter(np.array([True, False, False, False, True])).to_pandas()
>   a  b
> 0  0  1
> 1  4  1
> 
> Creating the boolean mask based on the table can be done with the
> pyarrow.compute module:
> 
>>>> import pyarrow.compute as pc
>>>> mask = pc.equal(table['b'], pa.scalar(1))
>>>> table.filter(mask).to_pandas()
>   a  b
> 0  0  1
> 1  2  1
> 2  4  1
> 
> So still more manual work than just specifying a DNF filter, but normally
> all necessary building blocks are available (the goal is certainly to use
> those building block in a more general query engine that works for both
> in-memory tables as file datasets, eventually, but that doesn't exist yet).
> 
> Best,
> Joris
> 
> On Thu, 1 Oct 2020 at 15:01, Niklas B <niklas.bivald@enplore.com <ma...@enplore.com>> wrote:
> 
>> Hi,
>> 
>> I have an in-memory dataset from Plasma that I need to filter before
>> running `to_pandas()`. It’s a very text heavy dataset with a lot of rows
>> and columns (only about 30% of which is applicable for any operation). Now
>> I know that you use DNF filters to filter a parquet file before reading to
>> memory. I’m now trying to do the same for my pa.Table that is already in
>> memory. https://issues.apache.org/jira/browse/ARROW-7945 <https://issues.apache.org/jira/browse/ARROW-7945> <
>> https://issues.apache.org/jira/browse/ARROW-7945 <https://issues.apache.org/jira/browse/ARROW-7945>> indicates that should
>> be possible to unify datasets after constructions, but my arrow skills
>> aren’t quite there yet.
>> 
>>> […]
>>> [data] = plasma_client.get_buffers([object_id], timeout_ms=100)
>>> buffer = pa.BufferReader(data)
>>> reader = pa.RecordBatchStreamReader(buffer)
>>> record_batch = pa.Table.from_batches(reader)
>> 
>> 
>> I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset
>> does it (
>> http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>
>> <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>).
>> My thinking is that if I can cast my table to a UnionDataset I can use the
>> same _filter() code as ParquetDataset does. But I’m not having any luck
>> with my (albeit very naive) approach:
>> 
>>>>>> pyarrow.dataset.UnionDataset(table.schema, [table])
>> 
>> But that just gives me:
>> 
>>> Traceback (most recent call last):
>>>  File "<stdin>", line 1, in <module>
>>>  File "pyarrow/_dataset.pyx", line 429, in
>> pyarrow._dataset.UnionDataset.__init__
>>> TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset
>> 
>> 
>> I’m guessing somewhere deep in
>> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py <https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py>
>> <
>> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py <https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py>>
>> shows an example on how to make a Dataset out of a table, but I’m not
>> finding it.
>> 
>> Any help would be greatly appreciated :)
>> 
>> Regards,
>> Niklas


Re: Using DNF-like filters on a (py)arrow Table already in memory (or probably: convert pyarrow table to UnionDataset)

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi Niklas,

In the datasets project, there is indeed the notion of an in-memory dataset
(from a RecordBatch or Table), however, constructing such a dataset is
currently not directly exposed in Python (except fro writing it).

But, for RecordBatch/Table objects, you can also directly filter those with
a boolean mask. A small example with a Table (but will work the same with
RecordBatch):

>>> table = pa.table({'a': range(5), 'b': [1, 2, 1, 2, 1]})
>>> table.filter(np.array([True, False, False, False, True]))
pyarrow.Table
a: int64
b: int64
>>> table.filter(np.array([True, False, False, False, True])).to_pandas()
   a  b
0  0  1
1  4  1

Creating the boolean mask based on the table can be done with the
pyarrow.compute module:

>>> import pyarrow.compute as pc
>>> mask = pc.equal(table['b'], pa.scalar(1))
>>> table.filter(mask).to_pandas()
   a  b
0  0  1
1  2  1
2  4  1

So still more manual work than just specifying a DNF filter, but normally
all necessary building blocks are available (the goal is certainly to use
those building block in a more general query engine that works for both
in-memory tables as file datasets, eventually, but that doesn't exist yet).

Best,
Joris

On Thu, 1 Oct 2020 at 15:01, Niklas B <ni...@enplore.com> wrote:

> Hi,
>
> I have an in-memory dataset from Plasma that I need to filter before
> running `to_pandas()`. It’s a very text heavy dataset with a lot of rows
> and columns (only about 30% of which is applicable for any operation). Now
> I know that you use DNF filters to filter a parquet file before reading to
> memory. I’m now trying to do the same for my pa.Table that is already in
> memory. https://issues.apache.org/jira/browse/ARROW-7945 <
> https://issues.apache.org/jira/browse/ARROW-7945> indicates that should
> be possible to unify datasets after constructions, but my arrow skills
> aren’t quite there yet.
>
> > […]
> > [data] = plasma_client.get_buffers([object_id], timeout_ms=100)
> > buffer = pa.BufferReader(data)
> > reader = pa.RecordBatchStreamReader(buffer)
> > record_batch = pa.Table.from_batches(reader)
>
>
> I’ve been reading up on Dataset, UnionDataset and how the ParquetDataset
> does it (
> http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset
> <http://arrow.apache.org/docs/_modules/pyarrow/parquet.html#ParquetDataset>).
> My thinking is that if I can cast my table to a UnionDataset I can use the
> same _filter() code as ParquetDataset does. But I’m not having any luck
> with my (albeit very naive) approach:
>
> > >>> pyarrow.dataset.UnionDataset(table.schema, [table])
>
> But that just gives me:
>
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in <module>
> >   File "pyarrow/_dataset.pyx", line 429, in
> pyarrow._dataset.UnionDataset.__init__
> > TypeError: Cannot convert pyarrow.lib.Table to pyarrow._dataset.Dataset
>
>
> I’m guessing somewhere deep in
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py
> <
> https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_dataset.py>
> shows an example on how to make a Dataset out of a table, but I’m not
> finding it.
>
> Any help would be greatly appreciated :)
>
> Regards,
> Niklas