You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Yann Moisan <ya...@gmail.com> on 2018/11/14 20:07:29 UTC

[Spark SQL] [Spark 2.4.0] Performance regression when reading parquet files from S3

Hello,

A Spark job on EMR reads parquet files located in an s3 bucket.

I use this option : spark.hadoop.fs.s3a.experimental.input.fadvise=random

When the ec2 instances and the bucket are in the same region, performance
are quite the same but when there are not, performance drops down (job
duration is multiplied by 2).

Note :  using the default value for the parameter mitigate the issue.

spark.hadoop.fs.s3a.experimental.input.fadvise=sequential

Any idea on what has changed in Spark 2.4.0 that could explain this issue ?