You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by patcharee <Pa...@uni.no> on 2016/01/25 15:30:40 UTC

streaming textFileStream problem - got only ONE line

Hi,

My streaming application is receiving data from file system and just 
prints the input count every 1 sec interval, as the code below:

val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
val lines = ssc.textFileStream(args(0))
lines.count().print()

The problem is sometimes the data received from scc.textFileStream is 
ONLY ONE line. But in fact there are multiple lines in the new file 
found in that interval. See log below which shows three intervals. In 
the 2nd interval, the new file is: 
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file 
contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the 
header).

Any ideas/suggestions what the problem is?

-----------------------------------------------------------------------------------------
SPARK LOG
-----------------------------------------------------------------------------------------

16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were 
older than 1453731011000 ms: 1453731010000 ms
16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were 
older than 1453731011000 ms:
16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000 ms:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
-------------------------------------------
Time: 1453731072000 ms
-------------------------------------------
6288

16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were 
older than 1453731012000 ms: 1453731011000 ms
16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were 
older than 1453731012000 ms:
16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000 ms:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
-------------------------------------------
Time: 1453731073000 ms
-------------------------------------------
1

16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were 
older than 1453731013000 ms: 1453731012000 ms
16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were 
older than 1453731013000 ms:
16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000 ms:
hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
-------------------------------------------
Time: 1453731074000 ms
-------------------------------------------
6288


Thanks,
Patcharee

Re: streaming textFileStream problem - got only ONE line

Posted by patcharee <Pa...@uni.no>.
I moved them every interval to the monitored directory.

Patcharee

On 25. jan. 2016 22:30, Shixiong(Ryan) Zhu wrote:
> Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", 
> or write into it directly? `textFileStream` requires that files must 
> be written to the monitored directory by "moving" them from another 
> location within the same file system.
>
> On Mon, Jan 25, 2016 at 6:30 AM, patcharee <Patcharee.Thongtra@uni.no 
> <ma...@uni.no>> wrote:
>
>     Hi,
>
>     My streaming application is receiving data from file system and
>     just prints the input count every 1 sec interval, as the code below:
>
>     val sparkConf = new SparkConf()
>     val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
>     val lines = ssc.textFileStream(args(0))
>     lines.count().print()
>
>     The problem is sometimes the data received from scc.textFileStream
>     is ONLY ONE line. But in fact there are multiple lines in the new
>     file found in that interval. See log below which shows three
>     intervals. In the 2nd interval, the new file is:
>     hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This
>     file contains 6288 lines. The ssc.textFileStream returns ONLY ONE
>     line (the header).
>
>     Any ideas/suggestions what the problem is?
>
>     -----------------------------------------------------------------------------------------
>     SPARK LOG
>     -----------------------------------------------------------------------------------------
>
>     16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that
>     were older than 1453731011000 ms: 1453731010000 ms
>     16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that
>     were older than 1453731011000 ms:
>     16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
>     16/01/25 15:11:12 INFO FileInputDStream: New files at time
>     1453731072000 ms:
>     hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
>     -------------------------------------------
>     Time: 1453731072000 ms
>     -------------------------------------------
>     6288
>
>     16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that
>     were older than 1453731012000 ms: 1453731011000 ms
>     16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that
>     were older than 1453731012000 ms:
>     16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
>     16/01/25 15:11:13 INFO FileInputDStream: New files at time
>     1453731073000 ms:
>     hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
>     -------------------------------------------
>     Time: 1453731073000 ms
>     -------------------------------------------
>     1
>
>     16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that
>     were older than 1453731013000 ms: 1453731012000 ms
>     16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that
>     were older than 1453731013000 ms:
>     16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
>     16/01/25 15:11:14 INFO FileInputDStream: New files at time
>     1453731074000 ms:
>     hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
>     -------------------------------------------
>     Time: 1453731074000 ms
>     -------------------------------------------
>     6288
>
>
>     Thanks,
>     Patcharee
>
>


Re: streaming textFileStream problem - got only ONE line

Posted by Saisai Shao <sa...@gmail.com>.
Any possibility that this file is still written by other application, so
what Spark Streaming processed is an incomplete file.

On Tue, Jan 26, 2016 at 5:30 AM, Shixiong(Ryan) Zhu <shixiong@databricks.com
> wrote:

> Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or
> write into it directly? `textFileStream` requires that files must be
> written to the monitored directory by "moving" them from another location
> within the same file system.
>
> On Mon, Jan 25, 2016 at 6:30 AM, patcharee <Pa...@uni.no>
> wrote:
>
>> Hi,
>>
>> My streaming application is receiving data from file system and just
>> prints the input count every 1 sec interval, as the code below:
>>
>> val sparkConf = new SparkConf()
>> val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
>> val lines = ssc.textFileStream(args(0))
>> lines.count().print()
>>
>> The problem is sometimes the data received from scc.textFileStream is
>> ONLY ONE line. But in fact there are multiple lines in the new file found
>> in that interval. See log below which shows three intervals. In the 2nd
>> interval, the new file is:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file
>> contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the
>> header).
>>
>> Any ideas/suggestions what the problem is?
>>
>>
>> -----------------------------------------------------------------------------------------
>> SPARK LOG
>>
>> -----------------------------------------------------------------------------------------
>>
>> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were
>> older than 1453731011000 ms: 1453731010000 ms
>> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were
>> older than 1453731011000 ms:
>> 16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
>> 16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000
>> ms:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
>> -------------------------------------------
>> Time: 1453731072000 ms
>> -------------------------------------------
>> 6288
>>
>> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were
>> older than 1453731012000 ms: 1453731011000 ms
>> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were
>> older than 1453731012000 ms:
>> 16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
>> 16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000
>> ms:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
>> -------------------------------------------
>> Time: 1453731073000 ms
>> -------------------------------------------
>> 1
>>
>> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were
>> older than 1453731013000 ms: 1453731012000 ms
>> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were
>> older than 1453731013000 ms:
>> 16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
>> 16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000
>> ms:
>> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
>> -------------------------------------------
>> Time: 1453731074000 ms
>> -------------------------------------------
>> 6288
>>
>>
>> Thanks,
>> Patcharee
>>
>
>

Re: streaming textFileStream problem - got only ONE line

Posted by "Shixiong(Ryan) Zhu" <sh...@databricks.com>.
Did you move the file into "hdfs://helmhdfs/user/patcharee/cerdata/", or
write into it directly? `textFileStream` requires that files must be
written to the monitored directory by "moving" them from another location
within the same file system.

On Mon, Jan 25, 2016 at 6:30 AM, patcharee <Pa...@uni.no>
wrote:

> Hi,
>
> My streaming application is receiving data from file system and just
> prints the input count every 1 sec interval, as the code below:
>
> val sparkConf = new SparkConf()
> val ssc = new StreamingContext(sparkConf, Milliseconds(interval_ms))
> val lines = ssc.textFileStream(args(0))
> lines.count().print()
>
> The problem is sometimes the data received from scc.textFileStream is ONLY
> ONE line. But in fact there are multiple lines in the new file found in
> that interval. See log below which shows three intervals. In the 2nd
> interval, the new file is:
> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt. This file
> contains 6288 lines. The ssc.textFileStream returns ONLY ONE line (the
> header).
>
> Any ideas/suggestions what the problem is?
>
>
> -----------------------------------------------------------------------------------------
> SPARK LOG
>
> -----------------------------------------------------------------------------------------
>
> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 1 old files that were
> older than 1453731011000 ms: 1453731010000 ms
> 16/01/25 15:11:11 INFO FileInputDStream: Cleared 0 old files that were
> older than 1453731011000 ms:
> 16/01/25 15:11:12 INFO FileInputDStream: Finding new files took 4 ms
> 16/01/25 15:11:12 INFO FileInputDStream: New files at time 1453731072000
> ms:
> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19616.txt
> -------------------------------------------
> Time: 1453731072000 ms
> -------------------------------------------
> 6288
>
> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 1 old files that were
> older than 1453731012000 ms: 1453731011000 ms
> 16/01/25 15:11:12 INFO FileInputDStream: Cleared 0 old files that were
> older than 1453731012000 ms:
> 16/01/25 15:11:13 INFO FileInputDStream: Finding new files took 4 ms
> 16/01/25 15:11:13 INFO FileInputDStream: New files at time 1453731073000
> ms:
> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19617.txt
> -------------------------------------------
> Time: 1453731073000 ms
> -------------------------------------------
> 1
>
> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 1 old files that were
> older than 1453731013000 ms: 1453731012000 ms
> 16/01/25 15:11:13 INFO FileInputDStream: Cleared 0 old files that were
> older than 1453731013000 ms:
> 16/01/25 15:11:14 INFO FileInputDStream: Finding new files took 3 ms
> 16/01/25 15:11:14 INFO FileInputDStream: New files at time 1453731074000
> ms:
> hdfs://helmhdfs/user/patcharee/cerdata/datetime_19618.txt
> -------------------------------------------
> Time: 1453731074000 ms
> -------------------------------------------
> 6288
>
>
> Thanks,
> Patcharee
>