You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yash Sharma <ya...@gmail.com> on 2017/04/27 10:49:50 UTC

Spark reading parquet files behaved differently with number of paths

Hi Fellow Devs,
I have noticed the spark parquet reader behaves very differently in the two
scenarios over the same data set while:
1. passing a single path to parent path to data, vs
2. passing all the files individually to parquet(paths: String*)

The paths has about ~50K files. The first option is able to cope up with
all the data and the job completes in few hours, however, for a use case
where a subset of paths has to be passed, the job is just stuck for few
hours and dies later. It doesn't start executing anything and seems like
its doing some sort of 'file path exists' check sequentially before
starting the job.

Has anyone stumbled upon this issue ?

Appreciate any pointers.

Snippet:

events = spark.read \
    .schema(file_schema) \
    .option("basePath", 's3://path/to/data/') \
    .parquet(*list_of_paths)