You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@impala.apache.org by Vibhath Ileperuma <vi...@gmail.com> on 2021/03/22 05:54:00 UTC

Adding parquet files with multiple row groups into a S3 table

Hi all,

I noticed that impala written parquet files contain only one row group.
I'm using Apache NIFI to generate a set of parquet files and those parquet
files might contain more than one row group in one parquet file. I would
like to know how impala will be affected if I add these parquet files into
an impala s3 table (by adding a new partition).
Further, I would like to know how the pages are arranged in one row group
in a impala written parquet file..

Thanks & Regards

*Vibhath Ileperuma*

Re: Adding parquet files with multiple row groups into a S3 table

Posted by Zoltán Borók-Nagy <bo...@cloudera.com>.
Hi Vibhath,

Please make sure that your Impala is not affected by IMPALA-10310
<https://issues.apache.org/jira/browse/IMPALA-10310> (Impala 3.3 and 3.4
have this bug).
If your version has the bug then the workaround is to set
PARQUET_OBJECT_STORE_SPLIT_SIZE / fs.s3a.block.size to the row group size
used by your writer.

Cheers,
    Zoltan


On Mon, Mar 22, 2021 at 7:01 AM Tim Armstrong <ti...@gmail.com>
wrote:

> Impala can read files with multiple row groups fine - many other engines
> generate files like that and it comes up all the time.
>
> I believe the column chunks end up being written in the order of the table
> schema, but maybe someone else knows for sure.
>
> Impala targets a 64kb page size.
>
> On Sun, 21 Mar 2021 at 22:54, Vibhath Ileperuma <
> vibhatharunapriya@gmail.com> wrote:
>
>> Hi all,
>>
>> I noticed that impala written parquet files contain only one row group.
>> I'm using Apache NIFI to generate a set of parquet files and those
>> parquet files might contain more than one row group in one parquet file. I
>> would like to know how impala will be affected if I add these parquet files
>> into an impala s3 table (by adding a new partition).
>> Further, I would like to know how the pages are arranged in one row group
>> in a impala written parquet file..
>>
>> Thanks & Regards
>>
>> *Vibhath Ileperuma*
>>
>>
>>
>>

Re: Adding parquet files with multiple row groups into a S3 table

Posted by Tim Armstrong <ti...@gmail.com>.
Impala can read files with multiple row groups fine - many other engines
generate files like that and it comes up all the time.

I believe the column chunks end up being written in the order of the table
schema, but maybe someone else knows for sure.

Impala targets a 64kb page size.

On Sun, 21 Mar 2021 at 22:54, Vibhath Ileperuma <vi...@gmail.com>
wrote:

> Hi all,
>
> I noticed that impala written parquet files contain only one row group.
> I'm using Apache NIFI to generate a set of parquet files and those parquet
> files might contain more than one row group in one parquet file. I would
> like to know how impala will be affected if I add these parquet files into
> an impala s3 table (by adding a new partition).
> Further, I would like to know how the pages are arranged in one row group
> in a impala written parquet file..
>
> Thanks & Regards
>
> *Vibhath Ileperuma*
>
>
>
>