You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2021/07/06 06:48:13 UTC

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

There is a recent JIRA where a row-wise iterator was discussed:
https://issues.apache.org/jira/browse/ARROW-12970.

This should not be too hard to add (although there is a linked JIRA about
improving the performance of the pyarrow -> python objects conversion,
which might require some more engineering work to do), but of course what's
proposed in the JIRA is starting from a materialized record batch (so
similarly as the gist here, but I think this is good enough?).

On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com> wrote:

> I think this type of thing does make sense, at some point people like to
> be be able see their data in rows.
>
> It probably pays to have this conversation on dev@.  Doing this in a
> performant way might take some engineering work, but having a quick
> solution like the one described above might make sense.
>
> -Micah
>
> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev>
> wrote:
>
>> Hello,
>>
>> I've found myself wondering if there is a use case for using the
>> iter_batches method in python as an iterator in a similar style to a
>> server-side cursor in Postgres. Right now you can use an iterator of record
>> batches, but I wondered if having some sort of python native iterator might
>> be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
>> batched iterator of native python objects?
>>
>> Here is some example code that shows a similar result.
>>
>> from itertools import chain
>> from typing import Tuple, Any
>>
>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>
>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>
>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>
>> (or a gist if you prefer:
>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>
>> I realize arrow is a columnar format, but I wonder if having the buffered
>> row reading as a lazy iterator is a common enough use case with parquet +
>> object storage being so common as a database alternative.
>>
>> Thanks,
>> Grant
>>
>> --
>> Grant Williams
>> Machine Learning Engineer
>> https://github.com/grantmwilliams/
>>
>

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Posted by Wes McKinney <we...@gmail.com>.

I left a comment in Jira, but I agree that having a faster method to
"box" Arrow array values as Python objects would be useful in a lot of
places. Then these common C++ code paths could be used to "tupleize"
record batches reasonably efficiently

On Tue, Jul 6, 2021 at 3:08 PM Alessandro Molina
<al...@ursacomputing.com> wrote:
>
> I guess that doing it at the Parquet reader level might allow the implementation to better leverage row groups, without the need to keep in memory the whole Table when you are iterating over data. While the current jira issue seems to suggest the implementation for Table once it's already fully available.
>
> On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <jo...@gmail.com> wrote:
>>
>> There is a recent JIRA where a row-wise iterator was discussed: https://issues.apache.org/jira/browse/ARROW-12970.
>>
>> This should not be too hard to add (although there is a linked JIRA about improving the performance of the pyarrow -> python objects conversion, which might require some more engineering work to do), but of course what's proposed in the JIRA is starting from a materialized record batch (so similarly as the gist here, but I think this is good enough?).
>>
>> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com> wrote:
>>>
>>> I think this type of thing does make sense, at some point people like to be be able see their data in rows.
>>>
>>> It probably pays to have this conversation on dev@.  Doing this in a performant way might take some engineering work, but having a quick solution like the one described above might make sense.
>>>
>>> -Micah
>>>
>>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've found myself wondering if there is a use case for using the iter_batches method in python as an iterator in a similar style to a server-side cursor in Postgres. Right now you can use an iterator of record batches, but I wondered if having some sort of python native iterator might be worth it? Maybe a .to_pyiter() method that converts it to a lazy & batched iterator of native python objects?
>>>>
>>>> Here is some example code that shows a similar result.
>>>>
>>>> from itertools import chain
>>>> from typing import Tuple, Any
>>>>
>>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>>>
>>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>>>
>>>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>>>
>>>> (or a gist if you prefer: https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>>
>>>> I realize arrow is a columnar format, but I wonder if having the buffered row reading as a lazy iterator is a common enough use case with parquet + object storage being so common as a database alternative.
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> --
>>>> Grant Williams
>>>> Machine Learning Engineer
>>>> https://github.com/grantmwilliams/

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Posted by Joris Van den Bossche <jo...@gmail.com>.

Note that the "iter_batches" method on ParquetFile already gives you a
way to consume the Parquet file progressively with a stream of
RecordBatches without creating a single Table for the full Parquet
file (which will already leverage the row groups of the Parquet file).
The example in the JIRA used Table, but ther is no reason to not
expose such an iteration method on RecordBatch as well (and I had
updated the title of the JIRA to reflect that).

On Tue, 6 Jul 2021 at 15:08, Alessandro Molina
<al...@ursacomputing.com> wrote:
>
> I guess that doing it at the Parquet reader level might allow the implementation to better leverage row groups, without the need to keep in memory the whole Table when you are iterating over data. While the current jira issue seems to suggest the implementation for Table once it's already fully available.
>
> On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <jo...@gmail.com> wrote:
>>
>> There is a recent JIRA where a row-wise iterator was discussed: https://issues.apache.org/jira/browse/ARROW-12970.
>>
>> This should not be too hard to add (although there is a linked JIRA about improving the performance of the pyarrow -> python objects conversion, which might require some more engineering work to do), but of course what's proposed in the JIRA is starting from a materialized record batch (so similarly as the gist here, but I think this is good enough?).
>>
>> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com> wrote:
>>>
>>> I think this type of thing does make sense, at some point people like to be be able see their data in rows.
>>>
>>> It probably pays to have this conversation on dev@.  Doing this in a performant way might take some engineering work, but having a quick solution like the one described above might make sense.
>>>
>>> -Micah
>>>
>>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've found myself wondering if there is a use case for using the iter_batches method in python as an iterator in a similar style to a server-side cursor in Postgres. Right now you can use an iterator of record batches, but I wondered if having some sort of python native iterator might be worth it? Maybe a .to_pyiter() method that converts it to a lazy & batched iterator of native python objects?
>>>>
>>>> Here is some example code that shows a similar result.
>>>>
>>>> from itertools import chain
>>>> from typing import Tuple, Any
>>>>
>>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>>>
>>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>>>
>>>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>>>
>>>> (or a gist if you prefer: https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>>
>>>> I realize arrow is a columnar format, but I wonder if having the buffered row reading as a lazy iterator is a common enough use case with parquet + object storage being so common as a database alternative.
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> --
>>>> Grant Williams
>>>> Machine Learning Engineer
>>>> https://github.com/grantmwilliams/

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Posted by Joris Van den Bossche <jo...@gmail.com>.

Note that the "iter_batches" method on ParquetFile already gives you a
way to consume the Parquet file progressively with a stream of
RecordBatches without creating a single Table for the full Parquet
file (which will already leverage the row groups of the Parquet file).
The example in the JIRA used Table, but ther is no reason to not
expose such an iteration method on RecordBatch as well (and I had
updated the title of the JIRA to reflect that).

On Tue, 6 Jul 2021 at 15:08, Alessandro Molina
<al...@ursacomputing.com> wrote:
>
> I guess that doing it at the Parquet reader level might allow the implementation to better leverage row groups, without the need to keep in memory the whole Table when you are iterating over data. While the current jira issue seems to suggest the implementation for Table once it's already fully available.
>
> On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <jo...@gmail.com> wrote:
>>
>> There is a recent JIRA where a row-wise iterator was discussed: https://issues.apache.org/jira/browse/ARROW-12970.
>>
>> This should not be too hard to add (although there is a linked JIRA about improving the performance of the pyarrow -> python objects conversion, which might require some more engineering work to do), but of course what's proposed in the JIRA is starting from a materialized record batch (so similarly as the gist here, but I think this is good enough?).
>>
>> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com> wrote:
>>>
>>> I think this type of thing does make sense, at some point people like to be be able see their data in rows.
>>>
>>> It probably pays to have this conversation on dev@.  Doing this in a performant way might take some engineering work, but having a quick solution like the one described above might make sense.
>>>
>>> -Micah
>>>
>>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've found myself wondering if there is a use case for using the iter_batches method in python as an iterator in a similar style to a server-side cursor in Postgres. Right now you can use an iterator of record batches, but I wondered if having some sort of python native iterator might be worth it? Maybe a .to_pyiter() method that converts it to a lazy & batched iterator of native python objects?
>>>>
>>>> Here is some example code that shows a similar result.
>>>>
>>>> from itertools import chain
>>>> from typing import Tuple, Any
>>>>
>>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>>>
>>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>>>
>>>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>>>
>>>> (or a gist if you prefer: https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>>
>>>> I realize arrow is a columnar format, but I wonder if having the buffered row reading as a lazy iterator is a common enough use case with parquet + object storage being so common as a database alternative.
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> --
>>>> Grant Williams
>>>> Machine Learning Engineer
>>>> https://github.com/grantmwilliams/

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Posted by Wes McKinney <we...@gmail.com>.

I left a comment in Jira, but I agree that having a faster method to
"box" Arrow array values as Python objects would be useful in a lot of
places. Then these common C++ code paths could be used to "tupleize"
record batches reasonably efficiently

On Tue, Jul 6, 2021 at 3:08 PM Alessandro Molina
<al...@ursacomputing.com> wrote:
>
> I guess that doing it at the Parquet reader level might allow the implementation to better leverage row groups, without the need to keep in memory the whole Table when you are iterating over data. While the current jira issue seems to suggest the implementation for Table once it's already fully available.
>
> On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <jo...@gmail.com> wrote:
>>
>> There is a recent JIRA where a row-wise iterator was discussed: https://issues.apache.org/jira/browse/ARROW-12970.
>>
>> This should not be too hard to add (although there is a linked JIRA about improving the performance of the pyarrow -> python objects conversion, which might require some more engineering work to do), but of course what's proposed in the JIRA is starting from a materialized record batch (so similarly as the gist here, but I think this is good enough?).
>>
>> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com> wrote:
>>>
>>> I think this type of thing does make sense, at some point people like to be be able see their data in rows.
>>>
>>> It probably pays to have this conversation on dev@.  Doing this in a performant way might take some engineering work, but having a quick solution like the one described above might make sense.
>>>
>>> -Micah
>>>
>>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev> wrote:
>>>>
>>>> Hello,
>>>>
>>>> I've found myself wondering if there is a use case for using the iter_batches method in python as an iterator in a similar style to a server-side cursor in Postgres. Right now you can use an iterator of record batches, but I wondered if having some sort of python native iterator might be worth it? Maybe a .to_pyiter() method that converts it to a lazy & batched iterator of native python objects?
>>>>
>>>> Here is some example code that shows a similar result.
>>>>
>>>> from itertools import chain
>>>> from typing import Tuple, Any
>>>>
>>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>>>
>>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>>>
>>>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>>>
>>>> (or a gist if you prefer: https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>>
>>>> I realize arrow is a columnar format, but I wonder if having the buffered row reading as a lazy iterator is a common enough use case with parquet + object storage being so common as a database alternative.
>>>>
>>>> Thanks,
>>>> Grant
>>>>
>>>> --
>>>> Grant Williams
>>>> Machine Learning Engineer
>>>> https://github.com/grantmwilliams/

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Posted by Alessandro Molina <al...@ursacomputing.com>.

I guess that doing it at the Parquet reader level might allow the
implementation to better leverage row groups, without the need to keep in
memory the whole Table when you are iterating over data. While the current
jira issue seems to suggest the implementation for Table once it's already
fully available.

On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> There is a recent JIRA where a row-wise iterator was discussed:
> https://issues.apache.org/jira/browse/ARROW-12970.
>
> This should not be too hard to add (although there is a linked JIRA about
> improving the performance of the pyarrow -> python objects conversion,
> which might require some more engineering work to do), but of course what's
> proposed in the JIRA is starting from a materialized record batch (so
> similarly as the gist here, but I think this is good enough?).
>
> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I think this type of thing does make sense, at some point people like to
>> be be able see their data in rows.
>>
>> It probably pays to have this conversation on dev@.  Doing this in a
>> performant way might take some engineering work, but having a quick
>> solution like the one described above might make sense.
>>
>> -Micah
>>
>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev>
>> wrote:
>>
>>> Hello,
>>>
>>> I've found myself wondering if there is a use case for using the
>>> iter_batches method in python as an iterator in a similar style to a
>>> server-side cursor in Postgres. Right now you can use an iterator of record
>>> batches, but I wondered if having some sort of python native iterator might
>>> be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
>>> batched iterator of native python objects?
>>>
>>> Here is some example code that shows a similar result.
>>>
>>> from itertools import chain
>>> from typing import Tuple, Any
>>>
>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>>
>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>>
>>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>>
>>> (or a gist if you prefer:
>>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>
>>> I realize arrow is a columnar format, but I wonder if having the
>>> buffered row reading as a lazy iterator is a common enough use case with
>>> parquet + object storage being so common as a database alternative.
>>>
>>> Thanks,
>>> Grant
>>>
>>> --
>>> Grant Williams
>>> Machine Learning Engineer
>>> https://github.com/grantmwilliams/
>>>
>>

Re: [python] [iter_batches] Is there any value to an iterator based parquet reader in python?

Posted by Alessandro Molina <al...@ursacomputing.com>.

I guess that doing it at the Parquet reader level might allow the
implementation to better leverage row groups, without the need to keep in
memory the whole Table when you are iterating over data. While the current
jira issue seems to suggest the implementation for Table once it's already
fully available.

On Tue, Jul 6, 2021 at 8:48 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> There is a recent JIRA where a row-wise iterator was discussed:
> https://issues.apache.org/jira/browse/ARROW-12970.
>
> This should not be too hard to add (although there is a linked JIRA about
> improving the performance of the pyarrow -> python objects conversion,
> which might require some more engineering work to do), but of course what's
> proposed in the JIRA is starting from a materialized record batch (so
> similarly as the gist here, but I think this is good enough?).
>
> On Tue, 6 Jul 2021 at 05:03, Micah Kornfield <em...@gmail.com>
> wrote:
>
>> I think this type of thing does make sense, at some point people like to
>> be be able see their data in rows.
>>
>> It probably pays to have this conversation on dev@.  Doing this in a
>> performant way might take some engineering work, but having a quick
>> solution like the one described above might make sense.
>>
>> -Micah
>>
>> On Sun, Jun 27, 2021 at 6:23 AM Grant Williams <gr...@grantwilliams.dev>
>> wrote:
>>
>>> Hello,
>>>
>>> I've found myself wondering if there is a use case for using the
>>> iter_batches method in python as an iterator in a similar style to a
>>> server-side cursor in Postgres. Right now you can use an iterator of record
>>> batches, but I wondered if having some sort of python native iterator might
>>> be worth it? Maybe a .to_pyiter() method that converts it to a lazy &
>>> batched iterator of native python objects?
>>>
>>> Here is some example code that shows a similar result.
>>>
>>> from itertools import chain
>>> from typing import Tuple, Any
>>>
>>> def iter_parquet(parquet_file, columns = None, batch_size=1_000) -> Tuple[Any]:
>>>
>>>         record_batches = parquet_file.iter_batches(batch_size=batch_size, columns=columns)
>>>
>>>         # convert from columnar format of pyarrow arrays to a row format of python objects (yields tuples)
>>>         yield from chain.from_iterable(zip(*map(lambda col: col.to_pylist(), batch.columns)) for batch in record_batches)
>>>
>>> (or a gist if you prefer:
>>> https://gist.github.com/grantmwilliams/143fd60b3891959a733d0ce5e195f71d)
>>>
>>> I realize arrow is a columnar format, but I wonder if having the
>>> buffered row reading as a lazy iterator is a common enough use case with
>>> parquet + object storage being so common as a database alternative.
>>>
>>> Thanks,
>>> Grant
>>>
>>> --
>>> Grant Williams
>>> Machine Learning Engineer
>>> https://github.com/grantmwilliams/
>>>
>>