You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Luis Ángel Vicente Sánchez <la...@gmail.com> on 2014/07/07 16:11:45 UTC

Possible bug in Spark Streaming :: TextFileStream

I have a basic spark streaming job that is watching a folder, processing
any new file and updating a column family in cassandra using the new
cassandra-spark-driver.

I think there is a problem with SparkStreamingContext.textFileStream... if
I start my job in local mode with no files in the folder that is watched
and then I copy a bunch of files, sometimes spark is continually processing
those files again and again.

I have noticed that it usually happens when spark doesn't detect all new
files in one go... i.e. I copied 6 files and spark detected 3 of them as
new and processed them; then it detected the other 3 as new and processed
them. After it finished to process all 6 files, it detected again the first
3 files as new files and processed them... then the other 3... and again...
and again... and again.

Should I rise a JIRA issue?

Regards,

Luis

Re: Possible bug in Spark Streaming :: TextFileStream

Posted by Tathagata Das <ta...@gmail.com>.
On second thought I am not entirely sure whether that bug is the issue. Are
you continuously appending to the file that you have copied to the
directory? Because filestream works correctly when the files are atomically
moved to the monitored directory.

TD


On Mon, Jul 14, 2014 at 9:08 PM, Madabhattula Rajesh Kumar <
mrajaforu@gmail.com> wrote:

> Hi Team,
>
> Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath")
> API also? Please conform. If yes, could you please help me to fix this
> issue. I'm using spark 1.0.0 version.
>
> Regards,
> Rajesh
>
>
> On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Oh yes, this was a bug and it has been fixed. Checkout from the master
>> branch!
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC
>>
>> TD
>>
>>
>> On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez <
>> langel.groups@gmail.com> wrote:
>>
>>> I have a basic spark streaming job that is watching a folder, processing
>>> any new file and updating a column family in cassandra using the new
>>> cassandra-spark-driver.
>>>
>>> I think there is a problem with SparkStreamingContext.textFileStream...
>>> if I start my job in local mode with no files in the folder that is watched
>>> and then I copy a bunch of files, sometimes spark is continually processing
>>> those files again and again.
>>>
>>> I have noticed that it usually happens when spark doesn't detect all new
>>> files in one go... i.e. I copied 6 files and spark detected 3 of them as
>>> new and processed them; then it detected the other 3 as new and processed
>>> them. After it finished to process all 6 files, it detected again the first
>>> 3 files as new files and processed them... then the other 3... and again...
>>> and again... and again.
>>>
>>> Should I rise a JIRA issue?
>>>
>>> Regards,
>>>
>>> Luis
>>>
>>
>>
>

Re: Possible bug in Spark Streaming :: TextFileStream

Posted by Madabhattula Rajesh Kumar <mr...@gmail.com>.
Hi Team,

Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath")
API also? Please conform. If yes, could you please help me to fix this
issue. I'm using spark 1.0.0 version.

Regards,
Rajesh


On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das <ta...@gmail.com>
wrote:

> Oh yes, this was a bug and it has been fixed. Checkout from the master
> branch!
>
>
> https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC
>
> TD
>
>
> On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez <
> langel.groups@gmail.com> wrote:
>
>> I have a basic spark streaming job that is watching a folder, processing
>> any new file and updating a column family in cassandra using the new
>> cassandra-spark-driver.
>>
>> I think there is a problem with SparkStreamingContext.textFileStream...
>> if I start my job in local mode with no files in the folder that is watched
>> and then I copy a bunch of files, sometimes spark is continually processing
>> those files again and again.
>>
>> I have noticed that it usually happens when spark doesn't detect all new
>> files in one go... i.e. I copied 6 files and spark detected 3 of them as
>> new and processed them; then it detected the other 3 as new and processed
>> them. After it finished to process all 6 files, it detected again the first
>> 3 files as new files and processed them... then the other 3... and again...
>> and again... and again.
>>
>> Should I rise a JIRA issue?
>>
>> Regards,
>>
>> Luis
>>
>
>

Re: Possible bug in Spark Streaming :: TextFileStream

Posted by Tathagata Das <ta...@gmail.com>.
Oh yes, this was a bug and it has been fixed. Checkout from the master
branch!

https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC

TD


On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez <
langel.groups@gmail.com> wrote:

> I have a basic spark streaming job that is watching a folder, processing
> any new file and updating a column family in cassandra using the new
> cassandra-spark-driver.
>
> I think there is a problem with SparkStreamingContext.textFileStream... if
> I start my job in local mode with no files in the folder that is watched
> and then I copy a bunch of files, sometimes spark is continually processing
> those files again and again.
>
> I have noticed that it usually happens when spark doesn't detect all new
> files in one go... i.e. I copied 6 files and spark detected 3 of them as
> new and processed them; then it detected the other 3 as new and processed
> them. After it finished to process all 6 files, it detected again the first
> 3 files as new files and processed them... then the other 3... and again...
> and again... and again.
>
> Should I rise a JIRA issue?
>
> Regards,
>
> Luis
>