You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by David Yu <da...@optimizely.com> on 2016/10/06 16:30:57 UTC

RecordTooLargeException recovery

Hi,

Our Samza job (0.10.1) throws RecordTooLargeExceptions when flushing the KV
store change to the changelog topic, as well as sending outputs to Kafka.
We have two questions to this problem:

1. It seems that after the affected containers failed multiple times, the
job was able to recover and move on. This is a bit hard to understand. How
could this be recoverable? We were glad it actually did, but are
uncomfortable not knowing the reason behind it.
2. We would be the best way to prevent this from happening? Since Samza
serde happens behind the scenes, there does not seem to be a good way to
find out the payload size in bytes before putting into the KV store. Any
suggestions on this?

Thanks,
David

Re: RecordTooLargeException recovery

Posted by David Yu <da...@optimizely.com>.
Xinyu,

Thanks for the answers. Those suggestions are helpful as well.

David

On Thu, Oct 6, 2016 at 12:48 PM xinyu liu <xi...@gmail.com> wrote:

> Hi, David,
>
> For your questions:
>
> 1) In this case Samza recovered but the changelog message was lost. In
> 0.10.1 KafkaSystemProducer has a race condition: there is small chance the
> later send success might override the previous failure. The bug is fixed in
> the upcoming 0.11.0 release (SAMZA-1019). The fix allows you to catch the
> exception and then you can decide to ignore or rethrow it. In the latter
> case the container will fail and Samza will guarantee the message will be
> reprocessed after it restarts.
>
> 2) There are several ways that might help in your case: First you can turn
> on compression for your checkpoint stream. That usually saves about 20% -
> 30%. Second, you can also bump up the max.requst.size for the producer. In
> this case you need to make sure the broker also set up the corresponding
> max message size. Last, you might also try to split the key into subkeys so
> the value will be smaller.
>
> Thanks,
> Xinyu
>
> On Thu, Oct 6, 2016 at 9:30 AM, David Yu <da...@optimizely.com> wrote:
>
> > Hi,
> >
> > Our Samza job (0.10.1) throws RecordTooLargeExceptions when flushing the
> KV
> > store change to the changelog topic, as well as sending outputs to Kafka.
> > We have two questions to this problem:
> >
> > 1. It seems that after the affected containers failed multiple times, the
> > job was able to recover and move on. This is a bit hard to understand.
> How
> > could this be recoverable? We were glad it actually did, but are
> > uncomfortable not knowing the reason behind it.
> > 2. We would be the best way to prevent this from happening? Since Samza
> > serde happens behind the scenes, there does not seem to be a good way to
> > find out the payload size in bytes before putting into the KV store. Any
> > suggestions on this?
> >
> > Thanks,
> > David
> >
>

Re: RecordTooLargeException recovery

Posted by xinyu liu <xi...@gmail.com>.
Hi, David,

For your questions:

1) In this case Samza recovered but the changelog message was lost. In
0.10.1 KafkaSystemProducer has a race condition: there is small chance the
later send success might override the previous failure. The bug is fixed in
the upcoming 0.11.0 release (SAMZA-1019). The fix allows you to catch the
exception and then you can decide to ignore or rethrow it. In the latter
case the container will fail and Samza will guarantee the message will be
reprocessed after it restarts.

2) There are several ways that might help in your case: First you can turn
on compression for your checkpoint stream. That usually saves about 20% -
30%. Second, you can also bump up the max.requst.size for the producer. In
this case you need to make sure the broker also set up the corresponding
max message size. Last, you might also try to split the key into subkeys so
the value will be smaller.

Thanks,
Xinyu

On Thu, Oct 6, 2016 at 9:30 AM, David Yu <da...@optimizely.com> wrote:

> Hi,
>
> Our Samza job (0.10.1) throws RecordTooLargeExceptions when flushing the KV
> store change to the changelog topic, as well as sending outputs to Kafka.
> We have two questions to this problem:
>
> 1. It seems that after the affected containers failed multiple times, the
> job was able to recover and move on. This is a bit hard to understand. How
> could this be recoverable? We were glad it actually did, but are
> uncomfortable not knowing the reason behind it.
> 2. We would be the best way to prevent this from happening? Since Samza
> serde happens behind the scenes, there does not seem to be a good way to
> find out the payload size in bytes before putting into the KV store. Any
> suggestions on this?
>
> Thanks,
> David
>