You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Robert Bradshaw <ro...@google.com> on 2018/11/13 12:43:45 UTC

Re: [PROPOSAL] ParquetIO support for Python SDK

Was there resolution on how to handle row group size, given that it's
hard to pick a decent default? IIRC, the ideal was to base this on
byte sizes; will this be in v1 or will there be other parameter(s)
that we'll have to support going forward?
On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee <he...@google.com> wrote:
>
> Thanks all for the valuable feedback on the document. Here's the summary of planned features for ParquetIO Python SDK:
>
> Can read from Parquet file on any storage system supported by Beam
>
> Can write to Parquet file on any storage system supported by Beam
>
> Can configure the compression algorithm of output files
>
> Can adjust the size of the row group
>
> Can read multiple row groups in a single file parallelly (source splitting)
>
> Can partially read by columns
>
>
> It introduces new dependency pyarrow for parquet reading and writing operations.
>
> If you're interested, you can review and test the PR https://github.com/apache/beam/pull/6763
>
> Thanks,
>
> On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <ch...@google.com> wrote:
>>
>> Thanks Heejong. Added some comments. +1 for summarizing the doc in the email thread.
>>
>> - Cham
>>
>> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>> Thank you Heejong. Could you also share a summary of the design document (major points/decisions) in the mailing list?
>>>
>>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <he...@google.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm working on BEAM-4444: Parquet IO for Python SDK.
>>>>
>>>> Issue: https://issues.apache.org/jira/browse/BEAM-4444
>>>> Design doc: https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
>>>> WIP PR: https://github.com/apache/beam/pull/6763
>>>>
>>>> Any feedback is appreciated. Thanks!
>>>>
>>>

Re: [PROPOSAL] ParquetIO support for Python SDK

Posted by Heejong Lee <he...@google.com>.

In current PR, there will be two parameters that can control the final row
group size, row_group_buffer_size and record_batch_size. The records are
first stored as a list of columns and then transformed into a record batch
(a data structure defined in pyarrow) when the number of records in the
list reaches record_batch_size. Record batches form another list that will
be written as a single row group when the byte size of the record batch
list exceeds row_group_buffer_size. row_group_buffer_size is normally much
bigger than a row group data size in a parquet file so it's not an exact
estimation of a row group size written in a file but I guess this is the
best option we can do on the given limitation of python parquet libraries.
For better estimation of row group size in bytes, the parquet library
should provide buffered writing of a row group and a method returning the
size of encoded data in the writing buffer. No currently available python
parquet library implements these features.

On Tue, Nov 13, 2018 at 4:44 AM Robert Bradshaw <ro...@google.com> wrote:

> Was there resolution on how to handle row group size, given that it's
> hard to pick a decent default? IIRC, the ideal was to base this on
> byte sizes; will this be in v1 or will there be other parameter(s)
> that we'll have to support going forward?
> On Tue, Oct 30, 2018 at 10:42 PM Heejong Lee <he...@google.com> wrote:
> >
> > Thanks all for the valuable feedback on the document. Here's the summary
> of planned features for ParquetIO Python SDK:
> >
> > Can read from Parquet file on any storage system supported by Beam
> >
> > Can write to Parquet file on any storage system supported by Beam
> >
> > Can configure the compression algorithm of output files
> >
> > Can adjust the size of the row group
> >
> > Can read multiple row groups in a single file parallelly (source
> splitting)
> >
> > Can partially read by columns
> >
> >
> > It introduces new dependency pyarrow for parquet reading and writing
> operations.
> >
> > If you're interested, you can review and test the PR
> https://github.com/apache/beam/pull/6763
> >
> > Thanks,
> >
> > On Wed, Oct 24, 2018 at 5:37 PM Chamikara Jayalath <ch...@google.com>
> wrote:
> >>
> >> Thanks Heejong. Added some comments. +1 for summarizing the doc in the
> email thread.
> >>
> >> - Cham
> >>
> >> On Wed, Oct 24, 2018 at 4:45 PM Ahmet Altay <al...@google.com> wrote:
> >>>
> >>> Thank you Heejong. Could you also share a summary of the design
> document (major points/decisions) in the mailing list?
> >>>
> >>> On Wed, Oct 24, 2018 at 4:08 PM, Heejong Lee <he...@google.com>
> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I'm working on BEAM-4444: Parquet IO for Python SDK.
> >>>>
> >>>> Issue: https://issues.apache.org/jira/browse/BEAM-4444
> >>>> Design doc:
> https://docs.google.com/document/d/1-FT6zmjYhYFWXL8aDM5mNeiUnZdKnnB021zTo4S-0Wg
> >>>> WIP PR: https://github.com/apache/beam/pull/6763
> >>>>
> >>>> Any feedback is appreciated. Thanks!
> >>>>
> >>>
>