You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Mania Abdi <ma...@gmail.com> on 2020/09/04 23:18:09 UTC

Spark + Parquet, parquet dictionary

Hello everyone,

I have two questions about Parquet File format:
1. Where is the parquet dictionary is stored in ParquetFile? Is it stored
in the Footer of the file?  Or is it stored in each page?
2. When Spark reads a Parquet File, how is an RDD partitioned to read a
ParquetFile? Does it allocate one RDD partition per Parquet File? Or per
page? or per Page group? or per Block?

I would appreciate it if anyone can help me with these questions.

Regards
Mania

Re: Spark + Parquet, parquet dictionary

Posted by Micah Kornfield <em...@gmail.com>.
>
> 1. Where is the parquet dictionary is stored in ParquetFile? Is it stored
> in the Footer of the file?  Or is it stored in each page?


It is stored in its own page [1]

2. When Spark reads a Parquet File, how is an RDD partitioned to read a
> ParquetFile? Does it allocate one RDD partition per Parquet File? Or per
> page? or per Page group? or per Block?


This is better asked on the spark mailing list, but I would guess it
depends and probably couldn't be partitioned more granularly then a
RowGroup.


[1]
https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8

On Sat, Sep 5, 2020 at 9:54 AM Mania Abdi <ma...@gmail.com> wrote:

> Hello everyone,
>
> I have two questions about Parquet File format:
> 1. Where is the parquet dictionary is stored in ParquetFile? Is it stored
> in the Footer of the file?  Or is it stored in each page?
> 2. When Spark reads a Parquet File, how is an RDD partitioned to read a
> ParquetFile? Does it allocate one RDD partition per Parquet File? Or per
> page? or per Page group? or per Block?
>
> I would appreciate it if anyone can help me with these questions.
>
> Regards
> Mania
>