You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Mendelson, Assaf" <As...@rsa.com> on 2017/06/13 13:43:11 UTC

having trouble using structured streaming with file sink (parquet)

Hi all,

I have recently started assessing structured streaming and ran into a little snag from the beginning.

Basically I wanted to read some data, do some basic aggregation and write the result to file:

import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.streaming.ProcessingTime
val rawRecords = spark.readStream.schema(myschema).parquet("/mytest")
val q = rawRecords.withColumn("g",$"id" % 100).groupBy("g").agg(avg($"id"))
val res = q.writeStream.outputMode("complete").trigger(ProcessingTime("10 seconds")).format("parquet").option("path", "/test2").option("checkpointLocation", "/mycheckpoint").start

The problem is that it tells me that parquet does not support the complete mode (or update for that matter).
So how would I do a streaming with aggregation to file?
In general, my goal is to have a single (slow) streaming process which would write some profile and then have a second streaming process which would load the current dataframe to be used in join (I would stop the second streaming process and reload the dataframe periodically).

Any help would be appreciated.

Thanks,
              Assaf.