You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tomas Bartalos <to...@gmail.com> on 2019/09/19 16:10:06 UTC

Parquet read performance for different schemas

Hello,

I have 2 parquets (each containing 1 file):

   - parquet-wide - schema has 25 top level cols + 1 array
   - parquet-narrow - schema has 3 top level cols

Both files have same data for given columns.
When I read from parquet-wide spark reports* read 52.6 KB*, from
parquet-narrow *only 2.6 KB*.
For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
reading narrow parquet is much faster.

Since schema pruning is applied I *expected to get similar results* for
both scenarios (timing and amount of data read).
What do you think is the reason for such a big difference, is there any
tuning I can do ?

Thank you,
Tomas

Re: Parquet read performance for different schemas

Posted by Julien Laurenceau <ju...@pepitedata.com>.
Hi Tomas,

Parquet tuning time !!!
I strongly recommend you to read papers by CERN on spark parquet tuning
https://db-blog.web.cern.ch/blog/luca-canali/2017-06-diving-spark-and-parquet-workloads-example


You have to check the size of the row groups in your parquet files and
maybe tweak it a little.
In my memories, if parquet detects that it has two much cardinality in a
row group Pedicate push-down will not be enable and you'll be forced to
read the full row group even if you only need a single row.

Check the schema of your parquet files with parquet-tools (you don't need
spark for this) and do some tuning of your spark writing.
hadoop jar /.../parquet-tools-<VERSION>.jar <command>
my_parquet_file.parquet

You may also have a look to your hadoopConfiguration and in particular to :
spark.sql.parquet.mergeSchema
parquet.enable.summary-metadata

Regards,
Julien


Le ven. 20 sept. 2019 à 15:37, Tomas Bartalos <to...@gmail.com> a
écrit :

> I forgot to mention important part that I'm issuing same query to both
> parquets - selecting only one column:
>
> df.select(sum('amount))
>
> BR,
> Tomas
>
> št 19. 9. 2019 o 18:10 Tomas Bartalos <to...@gmail.com>
> napísal(a):
>
>> Hello,
>>
>> I have 2 parquets (each containing 1 file):
>>
>>    - parquet-wide - schema has 25 top level cols + 1 array
>>    - parquet-narrow - schema has 3 top level cols
>>
>> Both files have same data for given columns.
>> When I read from parquet-wide spark reports* read 52.6 KB*, from
>> parquet-narrow *only 2.6 KB*.
>> For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
>> reading narrow parquet is much faster.
>>
>> Since schema pruning is applied I *expected to get similar results* for
>> both scenarios (timing and amount of data read).
>> What do you think is the reason for such a big difference, is there any
>> tuning I can do ?
>>
>> Thank you,
>> Tomas
>>
>

Re: Parquet read performance for different schemas

Posted by Tomas Bartalos <to...@gmail.com>.
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:

df.select(sum('amount))

BR,
Tomas

št 19. 9. 2019 o 18:10 Tomas Bartalos <to...@gmail.com> napísal(a):

> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>    - parquet-wide - schema has 25 top level cols + 1 array
>    - parquet-narrow - schema has 3 top level cols
>
> Both files have same data for given columns.
> When I read from parquet-wide spark reports* read 52.6 KB*, from
> parquet-narrow *only 2.6 KB*.
> For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
> reading narrow parquet is much faster.
>
> Since schema pruning is applied I *expected to get similar results* for
> both scenarios (timing and amount of data read).
> What do you think is the reason for such a big difference, is there any
> tuning I can do ?
>
> Thank you,
> Tomas
>

Re: Parquet read performance for different schemas

Posted by Tomas Bartalos <to...@gmail.com>.
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:

df.select(sum('amount))

BR,
Tomas

št 19. 9. 2019 o 18:10 Tomas Bartalos <to...@gmail.com> napísal(a):

> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>    - parquet-wide - schema has 25 top level cols + 1 array
>    - parquet-narrow - schema has 3 top level cols
>
> Both files have same data for given columns.
> When I read from parquet-wide spark reports* read 52.6 KB*, from
> parquet-narrow *only 2.6 KB*.
> For bigger dataset the difference is *413 MB vs 961 MB*. Needless to say
> reading narrow parquet is much faster.
>
> Since schema pruning is applied I *expected to get similar results* for
> both scenarios (timing and amount of data read).
> What do you think is the reason for such a big difference, is there any
> tuning I can do ?
>
> Thank you,
> Tomas
>