You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by spr <sp...@yarcdata.com> on 2014/11/04 19:41:28 UTC

Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

I am trying to implement a use case that takes some human input.  Putting
that in a single file (as opposed to a collection of HDFS files) would be a
simpler human interface, so I tried an experiment with whether Spark
Streaming (via textFileStream) will recognize a new version of a filename it
has already digested.  (Yes, I'm deleting and moving a new file into the
same name, not modifying in place.)  It appears the answer is No, it does
not recognize a new version.  Can one of the experts confirm a) this is true
and b) this is intended?

Experiment:
- run an existing program that works to digest new files in a directory
- modify the data-creation script to put the new files always under the same
name instead of different names, then run the script

Outcome:  it sees the first file under that name, but none of the subsequent
files (with different contents, which would show up in output).



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

Posted by spr <sp...@yarcdata.com>.

Holden Karau wrote
> This is the expected behavior. Spark Streaming only reads new files once,
> this is why they must be created through an atomic move so that Spark
> doesn't accidentally read a partially written file. I'd recommend looking
> at "Basic Sources" in the Spark Streaming guide (
> http://spark.apache.org/docs/latest/streaming-programming-guide.html ).

Thanks for the quick response.

OK, this does seem consistent with the rest of Spark Streaming.  Looking at
Basic Sources, it says
"Once moved [into the directory being observed], the files must not be
changed."  I don't think of removing a file and creating a new one under the
same name as "changing" the file, i.e., it has a different inode number.  It
might be more precise to say something like "Once a filename has been
detected by Spark Streaming, it will be viewed as having been processed for
the life of the context."  This end-case also implies that any
filename-generating code has to be certain it will not create repeats within
the life of a context, which is not easily deduced from the existing
description.

Thanks again.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074p18076.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark Streaming appears not to recognize a more recent version of an already-seen file; true?

Posted by Holden Karau <ho...@pigscanfly.ca>.

This is the expected behavior. Spark Streaming only reads new files once,
this is why they must be created through an atomic move so that Spark
doesn't accidentally read a partially written file. I'd recommend looking
at "Basic Sources" in the Spark Streaming guide (
http://spark.apache.org/docs/latest/streaming-programming-guide.html ).

On Tue, Nov 4, 2014 at 10:41 AM, spr <sp...@yarcdata.com> wrote:

> I am trying to implement a use case that takes some human input.  Putting
> that in a single file (as opposed to a collection of HDFS files) would be a
> simpler human interface, so I tried an experiment with whether Spark
> Streaming (via textFileStream) will recognize a new version of a filename
> it
> has already digested.  (Yes, I'm deleting and moving a new file into the
> same name, not modifying in place.)  It appears the answer is No, it does
> not recognize a new version.  Can one of the experts confirm a) this is
> true
> and b) this is intended?
>
> Experiment:
> - run an existing program that works to digest new files in a directory
> - modify the data-creation script to put the new files always under the
> same
> name instead of different names, then run the script
>
> Outcome:  it sees the first file under that name, but none of the
> subsequent
> files (with different contents, which would show up in output).
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-appears-not-to-recognize-a-more-recent-version-of-an-already-seen-file-true-tp18074.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Cell : 425-233-8271