You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Lian Jiang <ji...@gmail.com> on 2019/03/12 02:52:21 UTC

read json and write into parquet in executors

Hi,

In my spark batch job,

step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3 and save into
hdfs.
step 3: the driver read these json files into a data frame and save into
parquet.

To improve performance by avoiding writing jsons to hdfs, I want to change
the workflow to:

step 1: the driver assigns a partition of json file path list to each
executor.
step 2: each executor gets these assigned json files from S3, merge the
json content in memory and directly write to parquet. No need to write
jsons to hdfs.

I cannot create dataframes in executors. Is this improvement feasible?
Appreciate any help!