You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by jlg <jg...@adzerk.com> on 2015/08/19 16:51:18 UTC

Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

Some background on what we're trying to do:

We have four Kinesis receivers with varying amounts of data coming through
them. Ultimately we work on a unioned stream that is getting about 11
MB/second of data. We use a batch size of 5 seconds. 

We create four distinct DStreams from this data that have different
aggregation computations (various combinations of
map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
records to JSON strings and writing them to S3). We want to do 30 minute
windows of computations on this data, to get a better compression rate for
the aggregates (there are a lot of repeated keys across this time frame, and
we want to combine them all -- we do this using reduceByKeyAndWindow). 

But even when trying to do 5 minute windows, we have issues with "Could not
compute split, block —— not found". This is being run on a YARN cluster and
it seems like the executors are getting killed even though they should have
plenty of memory. 

Also, it seems like no computation actually takes place until the end of the
window duration. This seems inefficient if there is a lot of data that you
know is going to be needed for the computation. Is there any good way around
this?

There are some of the configuration settings we are using for Spark:

spark.executor.memory=26000M,\
spark.executor.cores=4,\
spark.executor.instances=5,\
spark.driver.cores=4,\
spark.driver.memory=24000M,\
spark.default.parallelism=128,\
spark.streaming.blockInterval=100ms,\
spark.streaming.receiver.maxRate=20000,\
spark.akka.timeout=300,\
spark.storage.memoryFraction=0.6,\
spark.rdd.compress=true,\
spark.executor.instances=16,\
spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark.kryoserializer.buffer.max=2047m,\


Is this the correct way to do this, and how can I further debug to figure
out this issue? 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
You hit block not found issues when you processing time exceeds the batch
duration (this happens with receiver oriented streaming). If you are
consuming messages from Kafka then try to use the directStream or you can
also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer.
(This might slow things down a bit though).

Thanks
Best Regards

On Wed, Aug 19, 2015 at 8:21 PM, jlg <jg...@adzerk.com> wrote:

> Some background on what we're trying to do:
>
> We have four Kinesis receivers with varying amounts of data coming through
> them. Ultimately we work on a unioned stream that is getting about 11
> MB/second of data. We use a batch size of 5 seconds.
>
> We create four distinct DStreams from this data that have different
> aggregation computations (various combinations of
> map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
> records to JSON strings and writing them to S3). We want to do 30 minute
> windows of computations on this data, to get a better compression rate for
> the aggregates (there are a lot of repeated keys across this time frame,
> and
> we want to combine them all -- we do this using reduceByKeyAndWindow).
>
> But even when trying to do 5 minute windows, we have issues with "Could not
> compute split, block —— not found". This is being run on a YARN cluster and
> it seems like the executors are getting killed even though they should have
> plenty of memory.
>
> Also, it seems like no computation actually takes place until the end of
> the
> window duration. This seems inefficient if there is a lot of data that you
> know is going to be needed for the computation. Is there any good way
> around
> this?
>
> There are some of the configuration settings we are using for Spark:
>
> spark.executor.memory=26000M,\
> spark.executor.cores=4,\
> spark.executor.instances=5,\
> spark.driver.cores=4,\
> spark.driver.memory=24000M,\
> spark.default.parallelism=128,\
> spark.streaming.blockInterval=100ms,\
> spark.streaming.receiver.maxRate=20000,\
> spark.akka.timeout=300,\
> spark.storage.memoryFraction=0.6,\
> spark.rdd.compress=true,\
> spark.executor.instances=16,\
> spark.serializer=org.apache.spark.serializer.KryoSerializer,\
> spark.kryoserializer.buffer.max=2047m,\
>
>
> Is this the correct way to do this, and how can I further debug to figure
> out this issue?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>