You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Matt Kuiper <ma...@polarisalpha.com> on 2019/03/27 15:45:15 UTC

Parquet File Output Sink - Spark Structured Streaming

Hello,

I am new to Spark and Structured Streaming and have the following File Output Sink question:

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files.  I periodically feed the Stream input data (using Stream Reader to read in files), but it does not write output to Parquet file for each file provided as input.   Once I have given it a few files, it tends to write a Parquet file just fine.

I am wondering how to control the threshold to write.  I would like to be able force a new write to Parquet file for every new file provided as input (at least for intitial testing).   Any tips appreciated!

Thanks,
Matt


Re: Parquet File Output Sink - Spark Structured Streaming

Posted by Matt Kuiper <ma...@polarisalpha.com>.
Thanks Gabor - your comment helps me clarify my question.


Yes, I have maxFilesPerTrigger set to 1 on the Read Stream call.  I am also seeing the Streaming Query process the single input file, however a single file on input does not appear to result in the Streaming Query writing the output to the Parquet file.


Matt



________________________________
From: Gabor Somogyi <ga...@gmail.com>
Sent: Wednesday, March 27, 2019 10:20:18 AM
To: Matt Kuiper
Cc: user@spark.apache.org
Subject: Re: Parquet File Output Sink - Spark Structured Streaming

Hi Matt,

Maybe you could set maxFilesPerTrigger to 1.

BR,
G


On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper <ma...@polarisalpha.com>> wrote:

Hello,

I am new to Spark and Structured Streaming and have the following File Output Sink question:

Wondering what (and how to modify) triggers a Spark Sturctured Streaming Query (with Parquet File output sink configured) to write data to the parquet files.  I periodically feed the Stream input data (using Stream Reader to read in files), but it does not write output to Parquet file for each file provided as input.   Once I have given it a few files, it tends to write a Parquet file just fine.

I am wondering how to control the threshold to write.  I would like to be able force a new write to Parquet file for every new file provided as input (at least for intitial testing).   Any tips appreciated!

Thanks,
Matt


Re: Parquet File Output Sink - Spark Structured Streaming

Posted by Gabor Somogyi <ga...@gmail.com>.
Hi Matt,

Maybe you could set maxFilesPerTrigger to 1.

BR,
G


On Wed, Mar 27, 2019 at 4:45 PM Matt Kuiper <ma...@polarisalpha.com>
wrote:

> Hello,
>
> I am new to Spark and Structured Streaming and have the following File
> Output Sink question:
>
> Wondering what (and how to modify) triggers a Spark Sturctured Streaming
> Query (with Parquet File output sink configured) to write data to the
> parquet files.  I periodically feed the Stream input data (using
> Stream Reader to read in files), but it does not write output to Parquet
> file for each file provided as input.   Once I have given it a few files,
> it tends to write a Parquet file just fine.
>
> I am wondering how to control the threshold to write.  I would like to be
> able force a new write to Parquet file for every new file provided as input
> (at least for intitial testing).   Any tips appreciated!
>
> Thanks,
> Matt
>
>