You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lian Jiang <ji...@gmail.com> on 2019/03/12 02:52:21 UTC
read json and write into parquet in executors
Hi,
In my spark batch job,
step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3 and save into
hdfs.
step 3: the driver read these json files into a data frame and save into
parquet.
To improve performance by avoiding writing jsons to hdfs, I want to change
the workflow to:
step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3, merge the
json content in memory and directly write to parquet. No need to write
jsons to hdfs.
I cannot create dataframes in executors. Is this improvement feasible?
Appreciate any help!