You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Vibhath Ileperuma <vi...@gmail.com> on 2021/03/19 14:30:46 UTC

Writing parquet files to S3

Hi all,

I'm developing a NIFI flow to convert a set of csv data to parquet format
and upload them to a S3 bucket. I use a 'ConvertRecord' processor with a
csv reader and a parquet record set writer to convert data and use a
'PutS3Object' to send it to S3 bucket.

When converting, I need to make sure the parquet row group size is 256 MB
and each parquet file contains only one row group. Even Though it is
possible to set the row group size in ParquetRecordSetWriter, I couldn't
find a way to make sure each parquet file contains only one row group (If a
csv file contains data  more than required for a 256MB row group, multiple
parquet files should be generated).

I would be grateful if you could suggest a way to do this.

Thanks & Regards

*Vibhath Ileperuma*

Re: Writing parquet files to S3

Posted by Joe Witt <jo...@gmail.com>.
Not responding to the real question in the thread but "I'm using NIFI
1.13.1.".  Please switch to 1.13.2 right away due to a regression in
1.13.1


On Mon, Mar 22, 2021 at 12:24 AM Vibhath Ileperuma
<vi...@gmail.com> wrote:
>
> Hi Bryan,
>
> I'm planning to add these generated parquet files to an impala S3 table.
> I noticed that impala written parquet files contain only one row group. That's why I'm trying to write one row group per file.
>
> However, I tried to create small parquet files (Snappy compressed) first and use a MergeRecord Processor with a ParquetRecordSetWriter in which the row group size is set to 256 MB to generate parquet files with one row group. The configurations I used,
>
> Merge Strategy: Bin-Packing Algorithm
> Minimum Number of Records: 1
> Maximum Number of Records: 2500000   (2.5 million)
>  Minimum Bin Size : 230 MB
> Maximum Bin Size : 256 MB
> Max Bin Age: 20 minutes
>
> Note that, above mentioned small parquet files usually contain 200,000 records and size is about 21 MB- 22 MB. Hence about 12 files should be merged to generate one file.
>
> But when I run the processor, it always merges 19 files and generates files of size 415 MB - 417 MB.
>
> I'm using NIFI 1.13.1. Could you please let me know how to resolve this issue.
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>
>
>
> On Fri, Mar 19, 2021 at 8:45 PM Bryan Bende <bb...@gmail.com> wrote:
>>
>> Hello,
>>
>> What would the reason be to need only one row group per file? Parquet
>> files by design can have many row groups.
>>
>> The ParquetRecordSetWriter won't be able to do this since it is just
>> given an output stream to write all the records to, which happens to
>> be the outputstream for one flow file.
>>
>> -Bryan
>>
>> On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma
>> <vi...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I'm developing a NIFI flow to convert a set of csv data to parquet format and upload them to a S3 bucket. I use a 'ConvertRecord' processor with a csv reader and a parquet record set writer to convert data and use a 'PutS3Object' to send it to S3 bucket.
>> >
>> > When converting, I need to make sure the parquet row group size is 256 MB and each parquet file contains only one row group. Even Though it is possible to set the row group size in ParquetRecordSetWriter, I couldn't find a way to make sure each parquet file contains only one row group (If a csv file contains data  more than required for a 256MB row group, multiple parquet files should be generated).
>> >
>> > I would be grateful if you could suggest a way to do this.
>> >
>> > Thanks & Regards
>> >
>> > Vibhath Ileperuma
>> >
>> >
>> >

Re: Writing parquet files to S3

Posted by Vibhath Ileperuma <vi...@gmail.com>.
Hi Bryan,

I'm planning to add these generated parquet files to an impala S3 table.
I noticed that impala written parquet files contain only one row group.
That's why I'm trying to write one row group per file.

However, I tried to create small parquet files (Snappy compressed) first
and use a MergeRecord Processor with a ParquetRecordSetWriter in which the
row group size is set to 256 MB to generate parquet files with one row
group. The configurations I used,

   1. Merge Strategy: Bin-Packing Algorithm
   2. Minimum Number of Records: 1
   3. Maximum Number of Records: 2500000   (2.5 million)
   4.  Minimum Bin Size : 230 MB
   5. Maximum Bin Size : 256 MB
   6. Max Bin Age: 20 minutes

Note that, above mentioned small parquet files usually contain 200,000
records and size is about 21 MB- 22 MB. Hence about 12 files should be
merged to generate one file.

But when I run the processor, it always merges 19 files and generates files
of size 415 MB - 417 MB.

I'm using NIFI 1.13.1. Could you please let me know how to resolve this
issue.

Thanks & Regards

*Vibhath Ileperuma*





On Fri, Mar 19, 2021 at 8:45 PM Bryan Bende <bb...@gmail.com> wrote:

> Hello,
>
> What would the reason be to need only one row group per file? Parquet
> files by design can have many row groups.
>
> The ParquetRecordSetWriter won't be able to do this since it is just
> given an output stream to write all the records to, which happens to
> be the outputstream for one flow file.
>
> -Bryan
>
> On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma
> <vi...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I'm developing a NIFI flow to convert a set of csv data to parquet
> format and upload them to a S3 bucket. I use a 'ConvertRecord' processor
> with a csv reader and a parquet record set writer to convert data and use a
> 'PutS3Object' to send it to S3 bucket.
> >
> > When converting, I need to make sure the parquet row group size is 256
> MB and each parquet file contains only one row group. Even Though it is
> possible to set the row group size in ParquetRecordSetWriter, I couldn't
> find a way to make sure each parquet file contains only one row group (If a
> csv file contains data  more than required for a 256MB row group, multiple
> parquet files should be generated).
> >
> > I would be grateful if you could suggest a way to do this.
> >
> > Thanks & Regards
> >
> > Vibhath Ileperuma
> >
> >
> >
>

Re: Writing parquet files to S3

Posted by Bryan Bende <bb...@gmail.com>.
Hello,

What would the reason be to need only one row group per file? Parquet
files by design can have many row groups.

The ParquetRecordSetWriter won't be able to do this since it is just
given an output stream to write all the records to, which happens to
be the outputstream for one flow file.

-Bryan

On Fri, Mar 19, 2021 at 10:31 AM Vibhath Ileperuma
<vi...@gmail.com> wrote:
>
> Hi all,
>
> I'm developing a NIFI flow to convert a set of csv data to parquet format and upload them to a S3 bucket. I use a 'ConvertRecord' processor with a csv reader and a parquet record set writer to convert data and use a 'PutS3Object' to send it to S3 bucket.
>
> When converting, I need to make sure the parquet row group size is 256 MB and each parquet file contains only one row group. Even Though it is possible to set the row group size in ParquetRecordSetWriter, I couldn't find a way to make sure each parquet file contains only one row group (If a csv file contains data  more than required for a 256MB row group, multiple parquet files should be generated).
>
> I would be grateful if you could suggest a way to do this.
>
> Thanks & Regards
>
> Vibhath Ileperuma
>
>
>