You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Afshartous, Nick" <na...@turbine.com> on 2015/10/28 16:45:15 UTC

Spark/Kafka Streaming Job Gets Stuck

Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)  job and seeing a problem.  This is running in AWS/Yarn and the streaming batch interval is set to 3 minutes and this is a ten node cluster.

Testing at 30,000 events per second we are seeing the streaming job get stuck (stack trace below) for over an hour.

Thanks on any insights or suggestions.
--
      Nick

org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartitionsToPair(JavaDStreamLike.scala:43)
com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsumerDriver.runStream(StreamingKafkaConsumerDriver.java:125)
com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsumerDriver.main(StreamingKafkaConsumerDriver.java:71)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)

Notice: This communication is for the intended recipient(s) only and may contain confidential, proprietary, legally protected or privileged information of Turbine, Inc. If you are not the intended recipient(s), please notify the sender at once and delete this communication. Unauthorized use of the information in this communication is strictly prohibited and may be unlawful. For those recipients under contract with Turbine, Inc., the information in this communication is subject to the terms and conditions of any applicable contracts or agreements.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark/Kafka Streaming Job Gets Stuck

Posted by Adrian Tanase <at...@adobe.com>.

You can decouple the batch interval and the window sizes. If during processing you’re aggregating data and your operations benefit of an inverse function, then you can optimally process windows of data.

E.g. You could set a global batch interval of 10 seconds. You can process the incoming data from Kafka, aggregating the input.
Then you can create a window of 3 minutes (both length and slide) over the partial results. In this case the inverse function is not helpful as all the data is new in every window.

You can even coalesce the final Dstream to avoid writing many small files. For example you could be writing LESS files MORE OFTEN and achieve a similar effect.

All of this is of course hypothetical since I don’t know what processing you are applying to the data coming from Kafka. More like food for thought.

-adrian





On 10/29/15, 2:50 PM, "Afshartous, Nick" <na...@turbine.com> wrote:

>< Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes?
>
>Thanks for you input.  The 3 minute window was chosen because we write the output of each batch into S3.  And with smaller batch time intervals there were many small files being written to S3, something to avoid.  That was the explanation of the developer who made this decision (who's no longer on the team).   We're in the process of re-evaluating.
>--
>     Nick
>
>-----Original Message-----
>From: Adrian Tanase [mailto:atanase@adobe.com]
>Sent: Wednesday, October 28, 2015 4:53 PM
>To: Afshartous, Nick <na...@turbine.com>
>Cc: user@spark.apache.org
>Subject: Re: Spark/Kafka Streaming Job Gets Stuck
>
>Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes?
>
>You could also try increasing the parallelism via repartition to ensure smaller tasks that can safely fit in working memory.
>
>Sent from my iPhone
>
>> On 28 Oct 2015, at 17:45, Afshartous, Nick <na...@turbine.com> wrote:
>>
>>
>> Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)  job and seeing a problem.  This is running in AWS/Yarn and the streaming batch interval is set to 3 minutes and this is a ten node cluster.
>>
>> Testing at 30,000 events per second we are seeing the streaming job get stuck (stack trace below) for over an hour.
>>
>> Thanks on any insights or suggestions.
>> --
>>      Nick
>>
>> org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartiti
>> onsToPair(JavaDStreamLike.scala:43)
>> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
>> erDriver.runStream(StreamingKafkaConsumerDriver.java:125)
>> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
>> erDriver.main(StreamingKafkaConsumerDriver.java:71)
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> ava:57)
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> orImpl.java:43)
>> java.lang.reflect.Method.invoke(Method.java:606)
>> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(Application
>> Master.scala:480)
>>
>> Notice: This communication is for the intended recipient(s) only and may contain confidential, proprietary, legally protected or privileged information of Turbine, Inc. If you are not the intended recipient(s), please notify the sender at once and delete this communication. Unauthorized use of the information in this communication is strictly prohibited and may be unlawful. For those recipients under contract with Turbine, Inc., the information in this communication is subject to the terms and conditions of any applicable contracts or agreements.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For
>> additional commands, e-mail: user-help@spark.apache.org
>>
>
>Notice: This communication is for the intended recipient(s) only and may contain confidential, proprietary, legally protected or privileged information of Turbine, Inc. If you are not the intended recipient(s), please notify the sender at once and delete this communication. Unauthorized use of the information in this communication is strictly prohibited and may be unlawful. For those recipients under contract with Turbine, Inc., the information in this communication is subject to the terms and conditions of any applicable contracts or agreements.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark/Kafka Streaming Job Gets Stuck

Posted by Cody Koeninger <co...@koeninger.org>.

If you're writing to s3, want to avoid small files, and don't actually need
3 minute latency... you may want to consider just running a regular spark
job (using KafkaUtils.createRDD) at scheduled intervals rather than a
streaming job.

On Thu, Oct 29, 2015 at 8:16 AM, Sabarish Sasidharan <
sabarish.sasidharan@manthan.com> wrote:

> If you are writing to S3, also make sure that you are using the direct
> output committer. I don't have streaming jobs but it helps in my machine
> learning jobs. Also, though more partitions help in processing faster, they
> do slow down writes to S3. So you might want to coalesce before writing to
> S3.
>
> Regards
> Sab
> On 29-Oct-2015 6:21 pm, "Afshartous, Nick" <na...@turbine.com>
> wrote:
>
>> < Does it work as expected with smaller batch or smaller load? Could it
>> be that it's accumulating too many events over 3 minutes?
>>
>> Thanks for you input.  The 3 minute window was chosen because we write
>> the output of each batch into S3.  And with smaller batch time intervals
>> there were many small files being written to S3, something to avoid.  That
>> was the explanation of the developer who made this decision (who's no
>> longer on the team).   We're in the process of re-evaluating.
>> --
>>      Nick
>>
>> -----Original Message-----
>> From: Adrian Tanase [mailto:atanase@adobe.com]
>> Sent: Wednesday, October 28, 2015 4:53 PM
>> To: Afshartous, Nick <na...@turbine.com>
>> Cc: user@spark.apache.org
>> Subject: Re: Spark/Kafka Streaming Job Gets Stuck
>>
>> Does it work as expected with smaller batch or smaller load? Could it be
>> that it's accumulating too many events over 3 minutes?
>>
>> You could also try increasing the parallelism via repartition to ensure
>> smaller tasks that can safely fit in working memory.
>>
>> Sent from my iPhone
>>
>> > On 28 Oct 2015, at 17:45, Afshartous, Nick <na...@turbine.com>
>> wrote:
>> >
>> >
>> > Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)
>> job and seeing a problem.  This is running in AWS/Yarn and the streaming
>> batch interval is set to 3 minutes and this is a ten node cluster.
>> >
>> > Testing at 30,000 events per second we are seeing the streaming job get
>> stuck (stack trace below) for over an hour.
>> >
>> > Thanks on any insights or suggestions.
>> > --
>> >      Nick
>> >
>> > org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartiti
>> > onsToPair(JavaDStreamLike.scala:43)
>> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
>> > erDriver.runStream(StreamingKafkaConsumerDriver.java:125)
>> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
>> > erDriver.main(StreamingKafkaConsumerDriver.java:71)
>> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
>> > ava:57)
>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
>> > orImpl.java:43)
>> > java.lang.reflect.Method.invoke(Method.java:606)
>> > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(Application
>> > Master.scala:480)
>> >
>> > Notice: This communication is for the intended recipient(s) only and
>> may contain confidential, proprietary, legally protected or privileged
>> information of Turbine, Inc. If you are not the intended recipient(s),
>> please notify the sender at once and delete this communication.
>> Unauthorized use of the information in this communication is strictly
>> prohibited and may be unlawful. For those recipients under contract with
>> Turbine, Inc., the information in this communication is subject to the
>> terms and conditions of any applicable contracts or agreements.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For
>> > additional commands, e-mail: user-help@spark.apache.org
>> >
>>
>> Notice: This communication is for the intended recipient(s) only and may
>> contain confidential, proprietary, legally protected or privileged
>> information of Turbine, Inc. If you are not the intended recipient(s),
>> please notify the sender at once and delete this communication.
>> Unauthorized use of the information in this communication is strictly
>> prohibited and may be unlawful. For those recipients under contract with
>> Turbine, Inc., the information in this communication is subject to the
>> terms and conditions of any applicable contracts or agreements.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

RE: Spark/Kafka Streaming Job Gets Stuck

Posted by Sabarish Sasidharan <sa...@manthan.com>.

If you are writing to S3, also make sure that you are using the direct
output committer. I don't have streaming jobs but it helps in my machine
learning jobs. Also, though more partitions help in processing faster, they
do slow down writes to S3. So you might want to coalesce before writing to
S3.

Regards
Sab
On 29-Oct-2015 6:21 pm, "Afshartous, Nick" <na...@turbine.com> wrote:

> < Does it work as expected with smaller batch or smaller load? Could it be
> that it's accumulating too many events over 3 minutes?
>
> Thanks for you input.  The 3 minute window was chosen because we write the
> output of each batch into S3.  And with smaller batch time intervals there
> were many small files being written to S3, something to avoid.  That was
> the explanation of the developer who made this decision (who's no longer on
> the team).   We're in the process of re-evaluating.
> --
>      Nick
>
> -----Original Message-----
> From: Adrian Tanase [mailto:atanase@adobe.com]
> Sent: Wednesday, October 28, 2015 4:53 PM
> To: Afshartous, Nick <na...@turbine.com>
> Cc: user@spark.apache.org
> Subject: Re: Spark/Kafka Streaming Job Gets Stuck
>
> Does it work as expected with smaller batch or smaller load? Could it be
> that it's accumulating too many events over 3 minutes?
>
> You could also try increasing the parallelism via repartition to ensure
> smaller tasks that can safely fit in working memory.
>
> Sent from my iPhone
>
> > On 28 Oct 2015, at 17:45, Afshartous, Nick <na...@turbine.com>
> wrote:
> >
> >
> > Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)
> job and seeing a problem.  This is running in AWS/Yarn and the streaming
> batch interval is set to 3 minutes and this is a ten node cluster.
> >
> > Testing at 30,000 events per second we are seeing the streaming job get
> stuck (stack trace below) for over an hour.
> >
> > Thanks on any insights or suggestions.
> > --
> >      Nick
> >
> > org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartiti
> > onsToPair(JavaDStreamLike.scala:43)
> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
> > erDriver.runStream(StreamingKafkaConsumerDriver.java:125)
> > com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
> > erDriver.main(StreamingKafkaConsumerDriver.java:71)
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> > ava:57)
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> > orImpl.java:43)
> > java.lang.reflect.Method.invoke(Method.java:606)
> > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(Application
> > Master.scala:480)
> >
> > Notice: This communication is for the intended recipient(s) only and may
> contain confidential, proprietary, legally protected or privileged
> information of Turbine, Inc. If you are not the intended recipient(s),
> please notify the sender at once and delete this communication.
> Unauthorized use of the information in this communication is strictly
> prohibited and may be unlawful. For those recipients under contract with
> Turbine, Inc., the information in this communication is subject to the
> terms and conditions of any applicable contracts or agreements.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For
> > additional commands, e-mail: user-help@spark.apache.org
> >
>
> Notice: This communication is for the intended recipient(s) only and may
> contain confidential, proprietary, legally protected or privileged
> information of Turbine, Inc. If you are not the intended recipient(s),
> please notify the sender at once and delete this communication.
> Unauthorized use of the information in this communication is strictly
> prohibited and may be unlawful. For those recipients under contract with
> Turbine, Inc., the information in this communication is subject to the
> terms and conditions of any applicable contracts or agreements.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

RE: Spark/Kafka Streaming Job Gets Stuck

Posted by "Afshartous, Nick" <na...@turbine.com>.

< Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes?

Thanks for you input.  The 3 minute window was chosen because we write the output of each batch into S3.  And with smaller batch time intervals there were many small files being written to S3, something to avoid.  That was the explanation of the developer who made this decision (who's no longer on the team).   We're in the process of re-evaluating.
--
     Nick

-----Original Message-----
From: Adrian Tanase [mailto:atanase@adobe.com]
Sent: Wednesday, October 28, 2015 4:53 PM
To: Afshartous, Nick <na...@turbine.com>
Cc: user@spark.apache.org
Subject: Re: Spark/Kafka Streaming Job Gets Stuck

Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes?

You could also try increasing the parallelism via repartition to ensure smaller tasks that can safely fit in working memory.

Sent from my iPhone

> On 28 Oct 2015, at 17:45, Afshartous, Nick <na...@turbine.com> wrote:
>
>
> Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)  job and seeing a problem.  This is running in AWS/Yarn and the streaming batch interval is set to 3 minutes and this is a ten node cluster.
>
> Testing at 30,000 events per second we are seeing the streaming job get stuck (stack trace below) for over an hour.
>
> Thanks on any insights or suggestions.
> --
>      Nick
>
> org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartiti
> onsToPair(JavaDStreamLike.scala:43)
> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
> erDriver.runStream(StreamingKafkaConsumerDriver.java:125)
> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsum
> erDriver.main(StreamingKafkaConsumerDriver.java:71)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(Application
> Master.scala:480)
>
> Notice: This communication is for the intended recipient(s) only and may contain confidential, proprietary, legally protected or privileged information of Turbine, Inc. If you are not the intended recipient(s), please notify the sender at once and delete this communication. Unauthorized use of the information in this communication is strictly prohibited and may be unlawful. For those recipients under contract with Turbine, Inc., the information in this communication is subject to the terms and conditions of any applicable contracts or agreements.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For
> additional commands, e-mail: user-help@spark.apache.org
>

Notice: This communication is for the intended recipient(s) only and may contain confidential, proprietary, legally protected or privileged information of Turbine, Inc. If you are not the intended recipient(s), please notify the sender at once and delete this communication. Unauthorized use of the information in this communication is strictly prohibited and may be unlawful. For those recipients under contract with Turbine, Inc., the information in this communication is subject to the terms and conditions of any applicable contracts or agreements.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark/Kafka Streaming Job Gets Stuck

Posted by srungarapu vamsi <sr...@gmail.com>.

Other than @Adrian suggestions, check if the processing delay is more than
the batch processing time.

On Thu, Oct 29, 2015 at 2:23 AM, Adrian Tanase <at...@adobe.com> wrote:

> Does it work as expected with smaller batch or smaller load? Could it be
> that it's accumulating too many events over 3 minutes?
>
> You could also try increasing the parallelism via repartition to ensure
> smaller tasks that can safely fit in working memory.
>
> Sent from my iPhone
>
> > On 28 Oct 2015, at 17:45, Afshartous, Nick <na...@turbine.com>
> wrote:
> >
> >
> > Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)
> job and seeing a problem.  This is running in AWS/Yarn and the streaming
> batch interval is set to 3 minutes and this is a ten node cluster.
> >
> > Testing at 30,000 events per second we are seeing the streaming job get
> stuck (stack trace below) for over an hour.
> >
> > Thanks on any insights or suggestions.
> > --
> >      Nick
> >
> >
> org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartitionsToPair(JavaDStreamLike.scala:43)
> >
> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsumerDriver.runStream(StreamingKafkaConsumerDriver.java:125)
> >
> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsumerDriver.main(StreamingKafkaConsumerDriver.java:71)
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > java.lang.reflect.Method.invoke(Method.java:606)
> >
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)
> >
> > Notice: This communication is for the intended recipient(s) only and may
> contain confidential, proprietary, legally protected or privileged
> information of Turbine, Inc. If you are not the intended recipient(s),
> please notify the sender at once and delete this communication.
> Unauthorized use of the information in this communication is strictly
> prohibited and may be unlawful. For those recipients under contract with
> Turbine, Inc., the information in this communication is subject to the
> terms and conditions of any applicable contracts or agreements.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
/Vamsi

Re: Spark/Kafka Streaming Job Gets Stuck

Posted by Adrian Tanase <at...@adobe.com>.

Does it work as expected with smaller batch or smaller load? Could it be that it's accumulating too many events over 3 minutes?

You could also try increasing the parallelism via repartition to ensure smaller tasks that can safely fit in working memory.

Sent from my iPhone

> On 28 Oct 2015, at 17:45, Afshartous, Nick <na...@turbine.com> wrote:
> 
> 
> Hi, we are load testing our Spark 1.3 streaming (reading from Kafka)  job and seeing a problem.  This is running in AWS/Yarn and the streaming batch interval is set to 3 minutes and this is a ten node cluster.
> 
> Testing at 30,000 events per second we are seeing the streaming job get stuck (stack trace below) for over an hour.
> 
> Thanks on any insights or suggestions.
> --
>      Nick
> 
> org.apache.spark.streaming.api.java.AbstractJavaDStreamLike.mapPartitionsToPair(JavaDStreamLike.scala:43)
> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsumerDriver.runStream(StreamingKafkaConsumerDriver.java:125)
> com.wb.analytics.spark.services.streaming.drivers.StreamingKafkaConsumerDriver.main(StreamingKafkaConsumerDriver.java:71)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:480)
> 
> Notice: This communication is for the intended recipient(s) only and may contain confidential, proprietary, legally protected or privileged information of Turbine, Inc. If you are not the intended recipient(s), please notify the sender at once and delete this communication. Unauthorized use of the information in this communication is strictly prohibited and may be unlawful. For those recipients under contract with Turbine, Inc., the information in this communication is subject to the terms and conditions of any applicable contracts or agreements.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org