You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Mania Abdi <ma...@gmail.com> on 2020/09/04 23:18:09 UTC
Spark + Parquet, parquet dictionary
Hello everyone,
I have two questions about Parquet File format:
1. Where is the parquet dictionary is stored in ParquetFile? Is it stored
in the Footer of the file? Or is it stored in each page?
2. When Spark reads a Parquet File, how is an RDD partitioned to read a
ParquetFile? Does it allocate one RDD partition per Parquet File? Or per
page? or per Page group? or per Block?
I would appreciate it if anyone can help me with these questions.
Regards
Mania
Re: Spark + Parquet, parquet dictionary
Posted by Micah Kornfield <em...@gmail.com>.
>
> 1. Where is the parquet dictionary is stored in ParquetFile? Is it stored
> in the Footer of the file? Or is it stored in each page?
It is stored in its own page [1]
2. When Spark reads a Parquet File, how is an RDD partitioned to read a
> ParquetFile? Does it allocate one RDD partition per Parquet File? Or per
> page? or per Page group? or per Block?
This is better asked on the spark mailing list, but I would guess it
depends and probably couldn't be partitioned more granularly then a
RowGroup.
[1]
https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8
On Sat, Sep 5, 2020 at 9:54 AM Mania Abdi <ma...@gmail.com> wrote:
> Hello everyone,
>
> I have two questions about Parquet File format:
> 1. Where is the parquet dictionary is stored in ParquetFile? Is it stored
> in the Footer of the file? Or is it stored in each page?
> 2. When Spark reads a Parquet File, how is an RDD partitioned to read a
> ParquetFile? Does it allocate one RDD partition per Parquet File? Or per
> page? or per Page group? or per Block?
>
> I would appreciate it if anyone can help me with these questions.
>
> Regards
> Mania
>