You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by feng wang <wa...@gmail.com> on 2018/04/11 10:33:53 UTC

Structured Streaming output a lot pieces of files with Append Mode

Hi,
I have seen the doc in Spark 2.2 about Structured Steaming

> Append mode (default) - This is the default mode, where only the new rows
> added to the Result Table since the last trigger will be outputted to the
> sink. This is supported for only those queries where rows added to the
> Result Table is never going to change. Hence, this mode guarantees that
> each row will be output only once (assuming fault-tolerant sink). For
> example, <b>queries with only select, where, map, flatMap, filter, join,
> etc. will support Append mode.</b>

So I tried to output Streaming DataFrame to HDFS with sample code but get
many smaller files in target output path,

>
>     val df =  spark.readStream .option("sep", ",") .option("header", true)
> .option("quote","\"").csv(inputpath)
>     val flow: DataFrame => DataFrame = df.select("name")  // I also try to
> use df.withColumn(xxxx)
>     val Data: DataFrame = flow(df)
>     val query: StreamingQuery = Data.writeStream
>       .format("csv")
>       .option("header", "true")
>       .option("format", "append")
>       .option("path", output)
>       .option("checkpointLocation", "/tmp/checkout")
>       .outputMode(OutputMode.Append())
>       .start()    query.processAllAvailable()

I founded that there was 4 executors in Mesos web in the job duration

  My question is generic:

1.  Is it a bug with Append mode,I means why not write all records to one
file with append mode?

2.  Is there any way to write all records to one file except using `hadoop
getmerge` or `Data.coalesce(1).writeStream.xx`  not so well as  repartition
to 1 partition to generate 1 output file