You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by ArtemisDev <ar...@dtechspace.com> on 2020/06/07 17:41:51 UTC

Structured Streaming using File Source - How to handle live files

We were trying to use structured streaming from file source, but had 
problems getting the files read by Spark properly.  We have another 
process generating the data files in the Spark data source directory on 
a continuous basis.  What we have observed was that the moment a data 
file is created before the data producing process finished, it was read 
by Spark immediately without reaching the EOF.  Then Spark will never 
revisit the file.  So we only ended up with empty data content.  The 
only way to make it to work is to produce the data files in a separate 
directory (e.g. /tmp) and move them to the Spark's file source dir after 
the data generation completes.

My questions:  Is this a behavior by design or is there any way to 
control the Spark streaming process not to import a file while it is 
still being used by another process?  In other words, do we have to use 
the tmp dir to move data files around or can the data producing process 
and Spark share the same directory?

Thanks!

-- Nick


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Structured Streaming using File Source - How to handle live files

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

Yeah we generally read files from hdfs or object stores like S3, gcs, etc
where files cannot be updated.

Regards
Gourav

On Sun, 7 Jun 2020, 22:36 Jungtaek Lim, <ka...@gmail.com>
wrote:

> Hi Nick,
>
> I guess that's by design - Spark assumes the input file will not be
> modified once it is placed on the input path. This makes Spark easy to
> track the list of processed files vs unprocessed files. Assume input files
> can be modified, then Spark will have to enumerate all of files and track
> how many lines/bytes it reads "per file", even the bad case it may read the
> incomplete line (if the writer doesn't guarantee that) and crash or bring
> incorrect results.
>
> Hope this helps.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev <ar...@dtechspace.com> wrote:
>
>> We were trying to use structured streaming from file source, but had
>> problems getting the files read by Spark properly.  We have another
>> process generating the data files in the Spark data source directory on
>> a continuous basis.  What we have observed was that the moment a data
>> file is created before the data producing process finished, it was read
>> by Spark immediately without reaching the EOF.  Then Spark will never
>> revisit the file.  So we only ended up with empty data content.  The
>> only way to make it to work is to produce the data files in a separate
>> directory (e.g. /tmp) and move them to the Spark's file source dir after
>> the data generation completes.
>>
>> My questions:  Is this a behavior by design or is there any way to
>> control the Spark streaming process not to import a file while it is
>> still being used by another process?  In other words, do we have to use
>> the tmp dir to move data files around or can the data producing process
>> and Spark share the same directory?
>>
>> Thanks!
>>
>> -- Nick
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>

Re: Structured Streaming using File Source - How to handle live files

Posted by Jungtaek Lim <ka...@gmail.com>.

Hi Nick,

I guess that's by design - Spark assumes the input file will not be
modified once it is placed on the input path. This makes Spark easy to
track the list of processed files vs unprocessed files. Assume input files
can be modified, then Spark will have to enumerate all of files and track
how many lines/bytes it reads "per file", even the bad case it may read the
incomplete line (if the writer doesn't guarantee that) and crash or bring
incorrect results.

Hope this helps.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Jun 8, 2020 at 2:43 AM ArtemisDev <ar...@dtechspace.com> wrote:

> We were trying to use structured streaming from file source, but had
> problems getting the files read by Spark properly.  We have another
> process generating the data files in the Spark data source directory on
> a continuous basis.  What we have observed was that the moment a data
> file is created before the data producing process finished, it was read
> by Spark immediately without reaching the EOF.  Then Spark will never
> revisit the file.  So we only ended up with empty data content.  The
> only way to make it to work is to produce the data files in a separate
> directory (e.g. /tmp) and move them to the Spark's file source dir after
> the data generation completes.
>
> My questions:  Is this a behavior by design or is there any way to
> control the Spark streaming process not to import a file while it is
> still being used by another process?  In other words, do we have to use
> the tmp dir to move data files around or can the data producing process
> and Spark share the same directory?
>
> Thanks!
>
> -- Nick
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>