You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bryan Jeffrey <br...@gmail.com> on 2016/02/09 21:49:09 UTC

Spark Increase in Processing Time

All,

I am running the following versions:
- Spark 1.4.1
- Scala 2.11
- Kafka 0.8.2.1
- Spark Streaming

I am seeing my Spark Streaming job increase in processing time after it has
run for some period.

[image: Inline image 1]

If you look at the image above you can see the 'hockey stick' growth.  This
job is processing no input data (all batches have zero events). However,
after about 4-8 hours (in this case 6 hours) the processing time for each
job increases by around 20-30% (enough to push over my batch size).

The job does not have one stage that grows in time - instead all stages
grow.  I have an example of a given stage below in which we have the same
set of tasks, etc. that are simply taking longer to complete.  I've labeled
them 'long' and 'short' respectively.

Has anyone seen this behavior? Does anyone have ideas on how to correct?

Regards,

Bryan Jeffrey

Long Stage:
[image: Inline image 2]

Short Stage:
[image: Inline image 3]

Re: Spark Increase in Processing Time

Posted by Ted Yu <yu...@gmail.com>.

1.4.1 was released half a year ago.

I doubt whether there would be 1.4.x patch release any more.

Please consider upgrading.

On Tue, Feb 9, 2016 at 1:23 PM, Bryan <br...@gmail.com> wrote:

> Ted,
>
>
>
> We are using an inverse reducer function, but we do have a filter function
> in place to cull the key space .
>
>
>
> One thing I am thinking is that this increase in processing time may be
> associated with the ttl expiration time (which is currently 6 hours). It
> may be coincidence however; all streams have zero data so the RDD cleanup
> should be limited in scope. In the storage tab I see 28 bytes retained in
> memory (all other persisted data is size 0).
>
>
>
> I will try changing the ttl way up and see if that changes this hockey
> stick to a later time.
>
>
>
> Do you have other suggestions?
>
>
>
>
>
> Sent from my Windows 10 phone
>
>
>
> *From: *Ted Yu <yu...@gmail.com>
> *Sent: *Tuesday, February 9, 2016 4:16 PM
> *To: *Bryan Jeffrey <br...@gmail.com>
> *Cc: *user <us...@spark.apache.org>
> *Subject: *Re: Spark Increase in Processing Time
>
>
>
> Have you seen this thread ?
>
>
> http://search-hadoop.com/m/q3RTtM6WWs1yUHch2&subj=Re+Spark+streaming+Processing+time+keeps+increasing
>
>
>
> On Tue, Feb 9, 2016 at 12:49 PM, Bryan Jeffrey <br...@gmail.com>
> wrote:
>
> All,
>
>
>
> I am running the following versions:
>
> - Spark 1.4.1
>
> - Scala 2.11
>
> - Kafka 0.8.2.1
>
> - Spark Streaming
>
>
>
> I am seeing my Spark Streaming job increase in processing time after it
> has run for some period.
>
>
>
> [image: Inline image 1]
>
>
>
> If you look at the image above you can see the 'hockey stick' growth.
> This job is processing no input data (all batches have zero events).
> However, after about 4-8 hours (in this case 6 hours) the processing time
> for each job increases by around 20-30% (enough to push over my batch size).
>
>
>
> The job does not have one stage that grows in time - instead all stages
> grow.  I have an example of a given stage below in which we have the same
> set of tasks, etc. that are simply taking longer to complete.  I've labeled
> them 'long' and 'short' respectively.
>
>
>
> Has anyone seen this behavior? Does anyone have ideas on how to correct?
>
> Regards,
>
>
>
> Bryan Jeffrey
>
>
>
> Long Stage:
>
> [image: Inline image 2]
>
>
>
> Short Stage:
>
> [image: Inline image 3]
>
>
>
>
>
>
>
>
>

RE: Spark Increase in Processing Time

Posted by Bryan <br...@gmail.com>.

Ted,

We are using an inverse reducer function, but we do have a filter function in place to cull the key space . 

One thing I am thinking is that this increase in processing time may be associated with the ttl expiration time (which is currently 6 hours). It may be coincidence however; all streams have zero data so the RDD cleanup should be limited in scope. In the storage tab I see 28 bytes retained in memory (all other persisted data is size 0).

I will try changing the ttl way up and see if that changes this hockey stick to a later time. 

Do you have other suggestions?


Sent from my Windows 10 phone

From: Ted Yu
Sent: Tuesday, February 9, 2016 4:16 PM
To: Bryan Jeffrey
Cc: user
Subject: Re: Spark Increase in Processing Time

Have you seen this thread ?
http://search-hadoop.com/m/q3RTtM6WWs1yUHch2&subj=Re+Spark+streaming+Processing+time+keeps+increasing

On Tue, Feb 9, 2016 at 12:49 PM, Bryan Jeffrey <br...@gmail.com> wrote:
All,

I am running the following versions:
- Spark 1.4.1
- Scala 2.11
- Kafka 0.8.2.1
- Spark Streaming

I am seeing my Spark Streaming job increase in processing time after it has run for some period.  



If you look at the image above you can see the 'hockey stick' growth.  This job is processing no input data (all batches have zero events). However, after about 4-8 hours (in this case 6 hours) the processing time for each job increases by around 20-30% (enough to push over my batch size).

The job does not have one stage that grows in time - instead all stages grow.  I have an example of a given stage below in which we have the same set of tasks, etc. that are simply taking longer to complete.  I've labeled them 'long' and 'short' respectively.

Has anyone seen this behavior? Does anyone have ideas on how to correct?

Regards,

Bryan Jeffrey

Long Stage: 


Short Stage:

Re: Spark Increase in Processing Time

Posted by Ted Yu <yu...@gmail.com>.

Have you seen this thread ?
http://search-hadoop.com/m/q3RTtM6WWs1yUHch2&subj=Re+Spark+streaming+Processing+time+keeps+increasing

On Tue, Feb 9, 2016 at 12:49 PM, Bryan Jeffrey <br...@gmail.com>
wrote:

> All,
>
> I am running the following versions:
> - Spark 1.4.1
> - Scala 2.11
> - Kafka 0.8.2.1
> - Spark Streaming
>
> I am seeing my Spark Streaming job increase in processing time after it
> has run for some period.
>
> [image: Inline image 1]
>
> If you look at the image above you can see the 'hockey stick' growth.
> This job is processing no input data (all batches have zero events).
> However, after about 4-8 hours (in this case 6 hours) the processing time
> for each job increases by around 20-30% (enough to push over my batch size).
>
> The job does not have one stage that grows in time - instead all stages
> grow.  I have an example of a given stage below in which we have the same
> set of tasks, etc. that are simply taking longer to complete.  I've labeled
> them 'long' and 'short' respectively.
>
> Has anyone seen this behavior? Does anyone have ideas on how to correct?
>
> Regards,
>
> Bryan Jeffrey
>
> Long Stage:
> [image: Inline image 2]
>
> Short Stage:
> [image: Inline image 3]
>
>
>