You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Micael Capitão (JIRA)" <ji...@apache.org> on 2014/12/02 17:29:12 UTC

[jira] [Comment Edited] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop

    [ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231697#comment-14231697 ] 

Micael Capitão edited comment on SPARK-3553 at 12/2/14 4:28 PM:
----------------------------------------------------------------

I'm having that same issue running Spark Streaming locally on my Windows machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite after that. When I move another file to that dir it detects it and processes it, but after that it processes the first two files again and then the third one, repeating this in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It mixtures, for example, the 3rd with the 5th on the same batch and the 4th with the 6th in another batch, stopping repeating the first two files.


was (Author: capitao):
I'm having that same issue running Spark Streaming locally on my Windows machine.
I have somthing like:
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](cdrsDir, fileFilter(_), newFilesOnly = false)

The "cdrsDir" has initially 2 files in it.

On startup, Spark processes the existing files on "cdrsDir" and keeps quite after that. When I move another file to that dir it detects it and processes it, but after that it processes the first two files again and then the third one in an endless loop.
If I add a fourth one it keeps processing the first two files on the same batch and then processes the 3rd and the 4th files on another batch.

If I add more files it keeps repeating but the behaviour gets weirder. It mixtures, for example, the 3rd with the 5th on the same batch and the 4th with the 6th in another batch, stopping repeating the first two files.

> Spark Streaming app streams files that have already been streamed in an endless loop
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-3553
>                 URL: https://issues.apache.org/jira/browse/SPARK-3553
>             Project: Spark
>          Issue Type: Bug
>          Components: Streaming
>    Affects Versions: 1.0.1
>         Environment: Ec2 cluster - YARN
>            Reporter: Ezequiel Bella
>              Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each.
> The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost  , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed. 
> For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org