You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by feng wang <wa...@gmail.com> on 2018/04/11 10:33:53 UTC
Structured Streaming output a lot pieces of files with Append Mode
Hi,
I have seen the doc in Spark 2.2 about Structured Steaming
> Append mode (default) - This is the default mode, where only the new rows
> added to the Result Table since the last trigger will be outputted to the
> sink. This is supported for only those queries where rows added to the
> Result Table is never going to change. Hence, this mode guarantees that
> each row will be output only once (assuming fault-tolerant sink). For
> example, <b>queries with only select, where, map, flatMap, filter, join,
> etc. will support Append mode.</b>
So I tried to output Streaming DataFrame to HDFS with sample code but get
many smaller files in target output path,
>
> val df = spark.readStream .option("sep", ",") .option("header", true)
> .option("quote","\"").csv(inputpath)
> val flow: DataFrame => DataFrame = df.select("name") // I also try to
> use df.withColumn(xxxx)
> val Data: DataFrame = flow(df)
> val query: StreamingQuery = Data.writeStream
> .format("csv")
> .option("header", "true")
> .option("format", "append")
> .option("path", output)
> .option("checkpointLocation", "/tmp/checkout")
> .outputMode(OutputMode.Append())
> .start() query.processAllAvailable()
I founded that there was 4 executors in Mesos web in the job duration
My question is generic:
1. Is it a bug with Append mode,I means why not write all records to one
file with append mode?
2. Is there any way to write all records to one file except using `hadoop
getmerge` or `Data.coalesce(1).writeStream.xx` not so well as repartition
to 1 partition to generate 1 output file