You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by peay <pe...@protonmail.com.INVALID> on 2019/02/14 16:01:19 UTC

Spark lists paths after `write` - how to avoid refreshing the file index?

Hello,

I have a piece of code that looks roughly like this:

df = spark.read.parquet("s3://bucket/data.parquet/name=A", "s3://bucket/data.parquet/name=B")
df_out = df.xxxx     # Do stuff to transform df
df_out.write.partitionBy("name").parquet("s3://bucket/data.parquet")

I specific explicit paths when reading to avoid very long globbing phase, as there are many such partitions and I am only interested in a few. I observe that whenever Spark does the write, I see the actual write to S3 happen (through the output committer), but then the Spark driver spends a lot of time listing `s3://bucket/data.parquet/name=A`, `s3://bucket/[data.parquet/name=](http://data.parquet/name=A)B` (in practice, I have more prefixes, and this leads to parallel lists using Spark tasks, not just on the driver)>

I have seen at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L187 that after the write, the FileIndex is refreshed, which I suspect would be triggering that.

However, I am not exactly sure how that whole FileIndex mechanism works: is that the FileIndex from the data frame that we have read initially? How can I above this refresh, which is not useful in my case as I terminate the job after writing?

I've been looking for some documentation on the phenomenon but couldn't find anything.

Thanks for the help!