You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Justin Pihony <ju...@gmail.com> on 2015/03/14 21:18:24 UTC

Bug in Streaming files?

All,
    Looking into  this StackOverflow question
<https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469>  
it appears that there is a bug when utilizing the newFilesOnly parameter in
FileInputDStream. Before creating a ticket, I wanted to verify it here. The
gist is that this code is wrong:

val modTimeIgnoreThreshold = math.max(
        initialModTimeIgnoreThreshold,   // initial threshold based on
newFilesOnly setting
        currentTime - durationToRemember.milliseconds  // trailing end of
the remember window
      )

The problem is that if you set newFilesOnly to false, then the
initialModTimeIgnoreThreshold is always 0. This makes it always dropped out
of the max operation. So, the best you get is files that were put in the
directory (duration) from the start. 

Is this a bug or expected behavior; it seems like a bug to me.

If I am correct, this appears to be a bigger fix than just using min as it
would break other functionality.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Bug in Streaming files?

Posted by Sean Owen <so...@cloudera.com>.
No I don't think that much is a bug, since newFilesOnly=false removes
a constraint that otherwise exists, and that's what you see.

However read the closely related:
https://issues.apache.org/jira/browse/SPARK-6061

@tdas open question for you there.

On Sat, Mar 14, 2015 at 8:18 PM, Justin Pihony <ju...@gmail.com> wrote:
> All,
>     Looking into  this StackOverflow question
> <https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469>
> it appears that there is a bug when utilizing the newFilesOnly parameter in
> FileInputDStream. Before creating a ticket, I wanted to verify it here. The
> gist is that this code is wrong:
>
> val modTimeIgnoreThreshold = math.max(
>         initialModTimeIgnoreThreshold,   // initial threshold based on
> newFilesOnly setting
>         currentTime - durationToRemember.milliseconds  // trailing end of
> the remember window
>       )
>
> The problem is that if you set newFilesOnly to false, then the
> initialModTimeIgnoreThreshold is always 0. This makes it always dropped out
> of the max operation. So, the best you get is files that were put in the
> directory (duration) from the start.
>
> Is this a bug or expected behavior; it seems like a bug to me.
>
> If I am correct, this appears to be a bigger fix than just using min as it
> would break other functionality.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org