You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tobias Pfeiffer <tg...@preferred.jp> on 2014/07/01 02:53:29 UTC

Re: Could not compute split, block not found

Bill,

let's say the processing time is t' and the window size t. Spark does not
*require* t' < t. In fact, for *temporary* peaks in your streaming data, I
think the way Spark handles it is very nice, in particular since 1) it does
not mix up the order in which items arrived in the stream, so items from a
later window will always be processed later, and 2) because an increase in
data will not be punished with high load and unresponsive systems, but with
disk space consumption instead.

However, if all of your windows require t' > t processing time (and it's
not because you are waiting, but because you actually do some computation),
then you are in bad luck, because if you start processing the next window
while the previous one is still processed, you have less resources for each
and processing will take even longer. However, if you are only waiting
(e.g., for network I/O), then maybe you can employ some asynchronous
solution where your tasks return immediately and deliver their result via a
callback later?

Tobias



On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay <bi...@gmail.com> wrote:

> Tobias,
>
> Your suggestion is very helpful. I will definitely investigate it.
>
> Just curious. Suppose the batch size is t seconds. In practice, does Spark
> always require the program to finish processing the data of t seconds
> within t seconds' processing time? Can Spark begin to consume the new batch
> before finishing processing the next batch? If Spark can do them together,
> it may save the processing time and solve the problem of data piling up.
>
> Thanks!
>
> Bill
>
>
>
>
> On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer <tg...@preferred.jp> wrote:
>
>> If your batch size is one minute and it takes more than one minute to
>> process, then I guess that's what causes your problem. The processing of
>> the second batch will not start after the processing of the first is
>> finished, which leads to more and more data being stored and waiting for
>> processing; check the attached graph for a visualization of what I think
>> may happen.
>>
>> Can you maybe do something hacky like throwing away a part of the data so
>> that processing time gets below one minute, then check whether you still
>> get that error?
>>
>> Tobias
>>
>>
>> 
>>
>>
>> On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay <bi...@gmail.com>
>> wrote:
>>
>>> Tobias,
>>>
>>> Thanks for your help. I think in my case, the batch size is 1 minute.
>>> However, it takes my program more than 1 minute to process 1 minute's
>>> data. I am not sure whether it is because the unprocessed data pile up.
>>> Do you have an suggestion on how to check it and solve it? Thanks!
>>>
>>> Bill
>>>
>>>
>>> On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer <tg...@preferred.jp>
>>> wrote:
>>>
>>>> Bill,
>>>>
>>>> were you able to process all information in time, or did maybe some
>>>> unprocessed data pile up? I think when I saw this once, the reason
>>>> seemed to be that I had received more data than would fit in memory,
>>>> while waiting for processing, so old data was deleted. When it was
>>>> time to process that data, it didn't exist any more. Is that a
>>>> possible reason in your case?
>>>>
>>>> Tobias
>>>>
>>>> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay <bi...@gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > I am running a spark streaming job with 1 minute as the batch size.
>>>> It ran
>>>> > around 84 minutes and was killed because of the exception with the
>>>> following
>>>> > information:
>>>> >
>>>> > java.lang.Exception: Could not compute split, block
>>>> input-0-1403893740400
>>>> > not found
>>>> >
>>>> >
>>>> > Before it was killed, it was able to correctly generate output for
>>>> each
>>>> > batch.
>>>> >
>>>> > Any help on this will be greatly appreciated.
>>>> >
>>>> > Bill
>>>> >
>>>>
>>>
>>>
>>
>

Re: Could not compute split, block not found

Posted by Bill Jay <bi...@gmail.com>.

Hi Tathagata,

Yes. The input stream is from Kafka and my program reads the data, keeps
all the data in memory, process the data, and generate the output.

Bill


On Mon, Jun 30, 2014 at 11:45 PM, Tathagata Das <tathagata.das1565@gmail.com
> wrote:

> Are you by any change using only memory in the storage level of the input
> streams?
>
> TD
>
>
> On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:
>
>> Bill,
>>
>> let's say the processing time is t' and the window size t. Spark does not
>> *require* t' < t. In fact, for *temporary* peaks in your streaming data, I
>> think the way Spark handles it is very nice, in particular since 1) it does
>> not mix up the order in which items arrived in the stream, so items from a
>> later window will always be processed later, and 2) because an increase in
>> data will not be punished with high load and unresponsive systems, but with
>> disk space consumption instead.
>>
>> However, if all of your windows require t' > t processing time (and it's
>> not because you are waiting, but because you actually do some computation),
>> then you are in bad luck, because if you start processing the next window
>> while the previous one is still processed, you have less resources for each
>> and processing will take even longer. However, if you are only waiting
>> (e.g., for network I/O), then maybe you can employ some asynchronous
>> solution where your tasks return immediately and deliver their result via a
>> callback later?
>>
>> Tobias
>>
>>
>>
>> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay <bi...@gmail.com>
>> wrote:
>>
>>> Tobias,
>>>
>>> Your suggestion is very helpful. I will definitely investigate it.
>>>
>>> Just curious. Suppose the batch size is t seconds. In practice, does
>>> Spark always require the program to finish processing the data of t seconds
>>> within t seconds' processing time? Can Spark begin to consume the new batch
>>> before finishing processing the next batch? If Spark can do them together,
>>> it may save the processing time and solve the problem of data piling up.
>>>
>>> Thanks!
>>>
>>> Bill
>>>
>>>
>>>
>>>
>>> On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer <tg...@preferred.jp>
>>> wrote:
>>>
>>>> If your batch size is one minute and it takes more than one minute to
>>>> process, then I guess that's what causes your problem. The processing of
>>>> the second batch will not start after the processing of the first is
>>>> finished, which leads to more and more data being stored and waiting for
>>>> processing; check the attached graph for a visualization of what I think
>>>> may happen.
>>>>
>>>> Can you maybe do something hacky like throwing away a part of the data
>>>> so that processing time gets below one minute, then check whether you still
>>>> get that error?
>>>>
>>>> Tobias
>>>>
>>>>
>>>> 
>>>>
>>>>
>>>> On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay <bi...@gmail.com>
>>>> wrote:
>>>>
>>>>> Tobias,
>>>>>
>>>>> Thanks for your help. I think in my case, the batch size is 1 minute.
>>>>> However, it takes my program more than 1 minute to process 1 minute's
>>>>> data. I am not sure whether it is because the unprocessed data pile
>>>>> up. Do you have an suggestion on how to check it and solve it? Thanks!
>>>>>
>>>>> Bill
>>>>>
>>>>>
>>>>> On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer <tg...@preferred.jp>
>>>>> wrote:
>>>>>
>>>>>> Bill,
>>>>>>
>>>>>> were you able to process all information in time, or did maybe some
>>>>>> unprocessed data pile up? I think when I saw this once, the reason
>>>>>> seemed to be that I had received more data than would fit in memory,
>>>>>> while waiting for processing, so old data was deleted. When it was
>>>>>> time to process that data, it didn't exist any more. Is that a
>>>>>> possible reason in your case?
>>>>>>
>>>>>> Tobias
>>>>>>
>>>>>> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay <bi...@gmail.com>
>>>>>> wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > I am running a spark streaming job with 1 minute as the batch size.
>>>>>> It ran
>>>>>> > around 84 minutes and was killed because of the exception with the
>>>>>> following
>>>>>> > information:
>>>>>> >
>>>>>> > java.lang.Exception: Could not compute split, block
>>>>>> input-0-1403893740400
>>>>>> > not found
>>>>>> >
>>>>>> >
>>>>>> > Before it was killed, it was able to correctly generate output for
>>>>>> each
>>>>>> > batch.
>>>>>> >
>>>>>> > Any help on this will be greatly appreciated.
>>>>>> >
>>>>>> > Bill
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Could not compute split, block not found

Posted by Tathagata Das <ta...@gmail.com>.

Are you by any change using only memory in the storage level of the input
streams?

TD


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Bill,
>
> let's say the processing time is t' and the window size t. Spark does not
> *require* t' < t. In fact, for *temporary* peaks in your streaming data, I
> think the way Spark handles it is very nice, in particular since 1) it does
> not mix up the order in which items arrived in the stream, so items from a
> later window will always be processed later, and 2) because an increase in
> data will not be punished with high load and unresponsive systems, but with
> disk space consumption instead.
>
> However, if all of your windows require t' > t processing time (and it's
> not because you are waiting, but because you actually do some computation),
> then you are in bad luck, because if you start processing the next window
> while the previous one is still processed, you have less resources for each
> and processing will take even longer. However, if you are only waiting
> (e.g., for network I/O), then maybe you can employ some asynchronous
> solution where your tasks return immediately and deliver their result via a
> callback later?
>
> Tobias
>
>
>
> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay <bi...@gmail.com>
> wrote:
>
>> Tobias,
>>
>> Your suggestion is very helpful. I will definitely investigate it.
>>
>> Just curious. Suppose the batch size is t seconds. In practice, does
>> Spark always require the program to finish processing the data of t seconds
>> within t seconds' processing time? Can Spark begin to consume the new batch
>> before finishing processing the next batch? If Spark can do them together,
>> it may save the processing time and solve the problem of data piling up.
>>
>> Thanks!
>>
>> Bill
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer <tg...@preferred.jp>
>> wrote:
>>
>>> If your batch size is one minute and it takes more than one minute to
>>> process, then I guess that's what causes your problem. The processing of
>>> the second batch will not start after the processing of the first is
>>> finished, which leads to more and more data being stored and waiting for
>>> processing; check the attached graph for a visualization of what I think
>>> may happen.
>>>
>>> Can you maybe do something hacky like throwing away a part of the data
>>> so that processing time gets below one minute, then check whether you still
>>> get that error?
>>>
>>> Tobias
>>>
>>>
>>> 
>>>
>>>
>>> On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay <bi...@gmail.com>
>>> wrote:
>>>
>>>> Tobias,
>>>>
>>>> Thanks for your help. I think in my case, the batch size is 1 minute.
>>>> However, it takes my program more than 1 minute to process 1 minute's
>>>> data. I am not sure whether it is because the unprocessed data pile
>>>> up. Do you have an suggestion on how to check it and solve it? Thanks!
>>>>
>>>> Bill
>>>>
>>>>
>>>> On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer <tg...@preferred.jp>
>>>> wrote:
>>>>
>>>>> Bill,
>>>>>
>>>>> were you able to process all information in time, or did maybe some
>>>>> unprocessed data pile up? I think when I saw this once, the reason
>>>>> seemed to be that I had received more data than would fit in memory,
>>>>> while waiting for processing, so old data was deleted. When it was
>>>>> time to process that data, it didn't exist any more. Is that a
>>>>> possible reason in your case?
>>>>>
>>>>> Tobias
>>>>>
>>>>> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay <bi...@gmail.com>
>>>>> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I am running a spark streaming job with 1 minute as the batch size.
>>>>> It ran
>>>>> > around 84 minutes and was killed because of the exception with the
>>>>> following
>>>>> > information:
>>>>> >
>>>>> > java.lang.Exception: Could not compute split, block
>>>>> input-0-1403893740400
>>>>> > not found
>>>>> >
>>>>> >
>>>>> > Before it was killed, it was able to correctly generate output for
>>>>> each
>>>>> > batch.
>>>>> >
>>>>> > Any help on this will be greatly appreciated.
>>>>> >
>>>>> > Bill
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Could not compute split, block not found

Posted by Bill Jay <bi...@gmail.com>.

Hi Tobias,

Your explanation makes a lot of sense. Actually, I tried to use partial
data on the same program yesterday. It has been up for around 24 hours and
is still running correctly. Thanks!

Bill


On Mon, Jun 30, 2014 at 5:53 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Bill,
>
> let's say the processing time is t' and the window size t. Spark does not
> *require* t' < t. In fact, for *temporary* peaks in your streaming data, I
> think the way Spark handles it is very nice, in particular since 1) it does
> not mix up the order in which items arrived in the stream, so items from a
> later window will always be processed later, and 2) because an increase in
> data will not be punished with high load and unresponsive systems, but with
> disk space consumption instead.
>
> However, if all of your windows require t' > t processing time (and it's
> not because you are waiting, but because you actually do some computation),
> then you are in bad luck, because if you start processing the next window
> while the previous one is still processed, you have less resources for each
> and processing will take even longer. However, if you are only waiting
> (e.g., for network I/O), then maybe you can employ some asynchronous
> solution where your tasks return immediately and deliver their result via a
> callback later?
>
> Tobias
>
>
>
> On Tue, Jul 1, 2014 at 2:26 AM, Bill Jay <bi...@gmail.com>
> wrote:
>
>> Tobias,
>>
>> Your suggestion is very helpful. I will definitely investigate it.
>>
>> Just curious. Suppose the batch size is t seconds. In practice, does
>> Spark always require the program to finish processing the data of t seconds
>> within t seconds' processing time? Can Spark begin to consume the new batch
>> before finishing processing the next batch? If Spark can do them together,
>> it may save the processing time and solve the problem of data piling up.
>>
>> Thanks!
>>
>> Bill
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 4:49 AM, Tobias Pfeiffer <tg...@preferred.jp>
>> wrote:
>>
>>> If your batch size is one minute and it takes more than one minute to
>>> process, then I guess that's what causes your problem. The processing of
>>> the second batch will not start after the processing of the first is
>>> finished, which leads to more and more data being stored and waiting for
>>> processing; check the attached graph for a visualization of what I think
>>> may happen.
>>>
>>> Can you maybe do something hacky like throwing away a part of the data
>>> so that processing time gets below one minute, then check whether you still
>>> get that error?
>>>
>>> Tobias
>>>
>>>
>>> 
>>>
>>>
>>> On Mon, Jun 30, 2014 at 1:56 PM, Bill Jay <bi...@gmail.com>
>>> wrote:
>>>
>>>> Tobias,
>>>>
>>>> Thanks for your help. I think in my case, the batch size is 1 minute.
>>>> However, it takes my program more than 1 minute to process 1 minute's
>>>> data. I am not sure whether it is because the unprocessed data pile
>>>> up. Do you have an suggestion on how to check it and solve it? Thanks!
>>>>
>>>> Bill
>>>>
>>>>
>>>> On Sun, Jun 29, 2014 at 7:18 PM, Tobias Pfeiffer <tg...@preferred.jp>
>>>> wrote:
>>>>
>>>>> Bill,
>>>>>
>>>>> were you able to process all information in time, or did maybe some
>>>>> unprocessed data pile up? I think when I saw this once, the reason
>>>>> seemed to be that I had received more data than would fit in memory,
>>>>> while waiting for processing, so old data was deleted. When it was
>>>>> time to process that data, it didn't exist any more. Is that a
>>>>> possible reason in your case?
>>>>>
>>>>> Tobias
>>>>>
>>>>> On Sat, Jun 28, 2014 at 5:59 AM, Bill Jay <bi...@gmail.com>
>>>>> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I am running a spark streaming job with 1 minute as the batch size.
>>>>> It ran
>>>>> > around 84 minutes and was killed because of the exception with the
>>>>> following
>>>>> > information:
>>>>> >
>>>>> > java.lang.Exception: Could not compute split, block
>>>>> input-0-1403893740400
>>>>> > not found
>>>>> >
>>>>> >
>>>>> > Before it was killed, it was able to correctly generate output for
>>>>> each
>>>>> > batch.
>>>>> >
>>>>> > Any help on this will be greatly appreciated.
>>>>> >
>>>>> > Bill
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>