You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Gwen Shapira <gs...@cloudera.com> on 2015/02/11 01:32:14 UTC

[DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Hi Kafka Devs,

Need your thoughts on retriable exceptions:

If a user configures Kafka with min.isr > 1 and there are not enough
replicas to safely store the data, there are two possibilities:

1. The lack of replicas was discovered before the message was written. We
throw NotEnoughReplicas.
2. The lack of replicas was discovered after the message was written to
leader. In this case, we throw  NotEnoughReplicasAfterAppend.

Currently, both errors are Retriable. Which means that the new producer
will retry multiple times.
In case of the second exception, this will cause duplicates.

KAFKA-1697 suggests:
"we probably want to make NotEnoughReplicasAfterAppend a non-retriable
exception and let the client decide what to do."

I agreed that the client (the one using the Producer) should weight the
problems duplicates will cause vs. the probability of losing the message
and do something sensible and made the exception non-retriable.

In the RB (https://reviews.apache.org/r/29647/) Joel raised a good point:
(Joel, feel free to correct me if I misrepresented your point)

"I think our interpretation of retriable is as follows (but we can discuss
on the list if that needs to change): if the produce request hits an error,
and there is absolutely no point in retrying then that is a non-retriable
error. MessageSizeTooLarge is an example - since unless the producer
changes the request to make the messages smaller there is no point in
retrying.

...
Duplicates can arise even for other errors (e.g., request timed out). So
that side-effect is not compelling enough to warrant a change to make this
non-retriable. "

*(TL;DR;  )  Should exceptions where retries can cause duplicates should
still be *
*retriable?*

Gwen

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Posted by Gwen Shapira <gs...@cloudera.com>.

Makes sense to me.

Thanks for the very detailed clarification, Jay :)

Will leave NotEnoughReplicasAfterAppend as retriable.

Gwen

On Wed, Feb 11, 2015 at 4:18 PM, Guozhang Wang <wa...@gmail.com> wrote:

> If people have agreed upon this semantic:
>
> <quote>
> if you set retries > 0 you are saying "I accept
> duplicates but want to ensure my stuff gets written", if you set retries =
> 0 you are saying "I can't abide duplicates and am willing to tolerate
> loss". So Retryable for us means "retry may succeed".
> <\quote>
>
> then NotEnoughReplicasAfterAppend should be retriable.
>
> PS: we can probably make it clearer in the new producer config table?
>
> Guozhang
>
> On Wed, Feb 11, 2015 at 5:41 AM, Joel Koshy <jj...@gmail.com> wrote:
>
> > Thanks for the comments - however, it is not clear to me what your
> > preference is on making NotEnoughReplicasAfterAppend retriable vs
> > non-retriable.
> >
> > As for me, my preference is to leave it as retriable since it is clear
> > that the produce may succeed on a retry (and may introduce a
> > duplicate). I agree that idempotence will bring full closure to this
> > though.
> >
> > Anyone else have a preference on this?
> >
> > Thanks,
> >
> > Joel
> >
> > On Tue, Feb 10, 2015 at 08:23:08PM -0800, Jay Kreps wrote:
> > > Yeah there are really two concepts here as I think you noted:
> > > 1. Retry safe: we know that the write did not occur
> > > 2. Retry fixable: if you send that again it could work
> > >
> > > (probably there are better names for these).
> > >
> > > Some things we know did not do a write and may be fixed by retrying (no
> > > leader). Some things we know didn't do a write and are not worth
> retrying
> > > (message too large). Somethings we don't know and are worth retrying
> > > (network error), and probably some things we don't know and aren't
> worth
> > it
> > > (can't think of one though).
> > >
> > > (I feel like Donald Rumsfeld with the "known unknowns" thing).
> > >
> > > In the current world if you set retries > 0 you are saying "I accept
> > > duplicates but want to ensure my stuff gets written", if you set
> retries
> > =
> > > 0 you are saying "I can't abide duplicates and am willing to tolerate
> > > loss". So Retryable for us means "retry may succeed".
> > >
> > > Originally I thought of maybe trying to model both concepts. However
> the
> > > two arguments against it are:
> > > 1. Even if you do this the guarantee remains "at least once delivery"
> > > because: (1) in the network error case you just don't know, (2)
> consumer
> > > failure.
> > > 2. The proper fix for this is to add idempotence support on the server,
> > > which we should do.
> > >
> > > Doing idempotence support on the server will actually fix all duplicate
> > > problems, including the network error case (because of course the
> server
> > > knows whether your write went through even though the client doesn't).
> > When
> > > we have that then the client can always just retry anything marked
> > > Retriable (i.e. retry may work) without fear of duplicates.
> > >
> > > This gives exactly once delivery to the log, and a co-operating
> consumer
> > > can use the offset to dedupe and get it end-to-end.
> > >
> > > So that was why I had just left one type of Retriable and used it to
> mean
> > > "retry may work" and don't try to flag anything for duplicates.
> > >
> > > -Jay
> > >
> > >
> > >
> > >
> > > On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> > >
> > > > Hi Kafka Devs,
> > > >
> > > > Need your thoughts on retriable exceptions:
> > > >
> > > > If a user configures Kafka with min.isr > 1 and there are not enough
> > > > replicas to safely store the data, there are two possibilities:
> > > >
> > > > 1. The lack of replicas was discovered before the message was
> written.
> > We
> > > > throw NotEnoughReplicas.
> > > > 2. The lack of replicas was discovered after the message was written
> to
> > > > leader. In this case, we throw  NotEnoughReplicasAfterAppend.
> > > >
> > > > Currently, both errors are Retriable. Which means that the new
> producer
> > > > will retry multiple times.
> > > > In case of the second exception, this will cause duplicates.
> > > >
> > > > KAFKA-1697 suggests:
> > > > "we probably want to make NotEnoughReplicasAfterAppend a
> non-retriable
> > > > exception and let the client decide what to do."
> > > >
> > > > I agreed that the client (the one using the Producer) should weight
> the
> > > > problems duplicates will cause vs. the probability of losing the
> > message
> > > > and do something sensible and made the exception non-retriable.
> > > >
> > > > In the RB (https://reviews.apache.org/r/29647/) Joel raised a good
> > point:
> > > > (Joel, feel free to correct me if I misrepresented your point)
> > > >
> > > > "I think our interpretation of retriable is as follows (but we can
> > discuss
> > > > on the list if that needs to change): if the produce request hits an
> > error,
> > > > and there is absolutely no point in retrying then that is a
> > non-retriable
> > > > error. MessageSizeTooLarge is an example - since unless the producer
> > > > changes the request to make the messages smaller there is no point in
> > > > retrying.
> > > >
> > > > ...
> > > > Duplicates can arise even for other errors (e.g., request timed out).
> > So
> > > > that side-effect is not compelling enough to warrant a change to make
> > this
> > > > non-retriable. "
> > > >
> > > > *(TL;DR;  )  Should exceptions where retries can cause duplicates
> > should
> > > > still be *
> > > > *retriable?*
> > > >
> > > > Gwen
> > > >
> >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Posted by Guozhang Wang <wa...@gmail.com>.

If people have agreed upon this semantic:

<quote>
if you set retries > 0 you are saying "I accept
duplicates but want to ensure my stuff gets written", if you set retries =
0 you are saying "I can't abide duplicates and am willing to tolerate
loss". So Retryable for us means "retry may succeed".
<\quote>

then NotEnoughReplicasAfterAppend should be retriable.

PS: we can probably make it clearer in the new producer config table?

Guozhang

On Wed, Feb 11, 2015 at 5:41 AM, Joel Koshy <jj...@gmail.com> wrote:

> Thanks for the comments - however, it is not clear to me what your
> preference is on making NotEnoughReplicasAfterAppend retriable vs
> non-retriable.
>
> As for me, my preference is to leave it as retriable since it is clear
> that the produce may succeed on a retry (and may introduce a
> duplicate). I agree that idempotence will bring full closure to this
> though.
>
> Anyone else have a preference on this?
>
> Thanks,
>
> Joel
>
> On Tue, Feb 10, 2015 at 08:23:08PM -0800, Jay Kreps wrote:
> > Yeah there are really two concepts here as I think you noted:
> > 1. Retry safe: we know that the write did not occur
> > 2. Retry fixable: if you send that again it could work
> >
> > (probably there are better names for these).
> >
> > Some things we know did not do a write and may be fixed by retrying (no
> > leader). Some things we know didn't do a write and are not worth retrying
> > (message too large). Somethings we don't know and are worth retrying
> > (network error), and probably some things we don't know and aren't worth
> it
> > (can't think of one though).
> >
> > (I feel like Donald Rumsfeld with the "known unknowns" thing).
> >
> > In the current world if you set retries > 0 you are saying "I accept
> > duplicates but want to ensure my stuff gets written", if you set retries
> =
> > 0 you are saying "I can't abide duplicates and am willing to tolerate
> > loss". So Retryable for us means "retry may succeed".
> >
> > Originally I thought of maybe trying to model both concepts. However the
> > two arguments against it are:
> > 1. Even if you do this the guarantee remains "at least once delivery"
> > because: (1) in the network error case you just don't know, (2) consumer
> > failure.
> > 2. The proper fix for this is to add idempotence support on the server,
> > which we should do.
> >
> > Doing idempotence support on the server will actually fix all duplicate
> > problems, including the network error case (because of course the server
> > knows whether your write went through even though the client doesn't).
> When
> > we have that then the client can always just retry anything marked
> > Retriable (i.e. retry may work) without fear of duplicates.
> >
> > This gives exactly once delivery to the log, and a co-operating consumer
> > can use the offset to dedupe and get it end-to-end.
> >
> > So that was why I had just left one type of Retriable and used it to mean
> > "retry may work" and don't try to flag anything for duplicates.
> >
> > -Jay
> >
> >
> >
> >
> > On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> >
> > > Hi Kafka Devs,
> > >
> > > Need your thoughts on retriable exceptions:
> > >
> > > If a user configures Kafka with min.isr > 1 and there are not enough
> > > replicas to safely store the data, there are two possibilities:
> > >
> > > 1. The lack of replicas was discovered before the message was written.
> We
> > > throw NotEnoughReplicas.
> > > 2. The lack of replicas was discovered after the message was written to
> > > leader. In this case, we throw  NotEnoughReplicasAfterAppend.
> > >
> > > Currently, both errors are Retriable. Which means that the new producer
> > > will retry multiple times.
> > > In case of the second exception, this will cause duplicates.
> > >
> > > KAFKA-1697 suggests:
> > > "we probably want to make NotEnoughReplicasAfterAppend a non-retriable
> > > exception and let the client decide what to do."
> > >
> > > I agreed that the client (the one using the Producer) should weight the
> > > problems duplicates will cause vs. the probability of losing the
> message
> > > and do something sensible and made the exception non-retriable.
> > >
> > > In the RB (https://reviews.apache.org/r/29647/) Joel raised a good
> point:
> > > (Joel, feel free to correct me if I misrepresented your point)
> > >
> > > "I think our interpretation of retriable is as follows (but we can
> discuss
> > > on the list if that needs to change): if the produce request hits an
> error,
> > > and there is absolutely no point in retrying then that is a
> non-retriable
> > > error. MessageSizeTooLarge is an example - since unless the producer
> > > changes the request to make the messages smaller there is no point in
> > > retrying.
> > >
> > > ...
> > > Duplicates can arise even for other errors (e.g., request timed out).
> So
> > > that side-effect is not compelling enough to warrant a change to make
> this
> > > non-retriable. "
> > >
> > > *(TL;DR;  )  Should exceptions where retries can cause duplicates
> should
> > > still be *
> > > *retriable?*
> > >
> > > Gwen
> > >
>
>


-- 
-- Guozhang

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Posted by Neha Narkhede <ne...@confluent.io>.

>
> As for me, my preference is to leave it as retriable since it is clear
> that the produce may succeed on a retry (and may introduce a
> duplicate). I agree that idempotence will bring full closure to this
> though.


+1

On Wed, Feb 11, 2015 at 5:24 PM, Jay Kreps <ja...@gmail.com> wrote:

> Yeah, hey Joel, that was a super long winded way of saying let's leave it
> at Retriable. I agree there is another concept which is "non duplicate
> producing" but when we do the idempotence stuff then all things will have
> that property so it may be okay just to leave it for now since network
> errors make it impossible to really do this properly without idempotence.
>
> -Jay
>
> On Wed, Feb 11, 2015 at 5:41 AM, Joel Koshy <jj...@gmail.com> wrote:
>
> > Thanks for the comments - however, it is not clear to me what your
> > preference is on making NotEnoughReplicasAfterAppend retriable vs
> > non-retriable.
> >
> > As for me, my preference is to leave it as retriable since it is clear
> > that the produce may succeed on a retry (and may introduce a
> > duplicate). I agree that idempotence will bring full closure to this
> > though.
> >
> > Anyone else have a preference on this?
> >
> > Thanks,
> >
> > Joel
> >
> > On Tue, Feb 10, 2015 at 08:23:08PM -0800, Jay Kreps wrote:
> > > Yeah there are really two concepts here as I think you noted:
> > > 1. Retry safe: we know that the write did not occur
> > > 2. Retry fixable: if you send that again it could work
> > >
> > > (probably there are better names for these).
> > >
> > > Some things we know did not do a write and may be fixed by retrying (no
> > > leader). Some things we know didn't do a write and are not worth
> retrying
> > > (message too large). Somethings we don't know and are worth retrying
> > > (network error), and probably some things we don't know and aren't
> worth
> > it
> > > (can't think of one though).
> > >
> > > (I feel like Donald Rumsfeld with the "known unknowns" thing).
> > >
> > > In the current world if you set retries > 0 you are saying "I accept
> > > duplicates but want to ensure my stuff gets written", if you set
> retries
> > =
> > > 0 you are saying "I can't abide duplicates and am willing to tolerate
> > > loss". So Retryable for us means "retry may succeed".
> > >
> > > Originally I thought of maybe trying to model both concepts. However
> the
> > > two arguments against it are:
> > > 1. Even if you do this the guarantee remains "at least once delivery"
> > > because: (1) in the network error case you just don't know, (2)
> consumer
> > > failure.
> > > 2. The proper fix for this is to add idempotence support on the server,
> > > which we should do.
> > >
> > > Doing idempotence support on the server will actually fix all duplicate
> > > problems, including the network error case (because of course the
> server
> > > knows whether your write went through even though the client doesn't).
> > When
> > > we have that then the client can always just retry anything marked
> > > Retriable (i.e. retry may work) without fear of duplicates.
> > >
> > > This gives exactly once delivery to the log, and a co-operating
> consumer
> > > can use the offset to dedupe and get it end-to-end.
> > >
> > > So that was why I had just left one type of Retriable and used it to
> mean
> > > "retry may work" and don't try to flag anything for duplicates.
> > >
> > > -Jay
> > >
> > >
> > >
> > >
> > > On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> > >
> > > > Hi Kafka Devs,
> > > >
> > > > Need your thoughts on retriable exceptions:
> > > >
> > > > If a user configures Kafka with min.isr > 1 and there are not enough
> > > > replicas to safely store the data, there are two possibilities:
> > > >
> > > > 1. The lack of replicas was discovered before the message was
> written.
> > We
> > > > throw NotEnoughReplicas.
> > > > 2. The lack of replicas was discovered after the message was written
> to
> > > > leader. In this case, we throw  NotEnoughReplicasAfterAppend.
> > > >
> > > > Currently, both errors are Retriable. Which means that the new
> producer
> > > > will retry multiple times.
> > > > In case of the second exception, this will cause duplicates.
> > > >
> > > > KAFKA-1697 suggests:
> > > > "we probably want to make NotEnoughReplicasAfterAppend a
> non-retriable
> > > > exception and let the client decide what to do."
> > > >
> > > > I agreed that the client (the one using the Producer) should weight
> the
> > > > problems duplicates will cause vs. the probability of losing the
> > message
> > > > and do something sensible and made the exception non-retriable.
> > > >
> > > > In the RB (https://reviews.apache.org/r/29647/) Joel raised a good
> > point:
> > > > (Joel, feel free to correct me if I misrepresented your point)
> > > >
> > > > "I think our interpretation of retriable is as follows (but we can
> > discuss
> > > > on the list if that needs to change): if the produce request hits an
> > error,
> > > > and there is absolutely no point in retrying then that is a
> > non-retriable
> > > > error. MessageSizeTooLarge is an example - since unless the producer
> > > > changes the request to make the messages smaller there is no point in
> > > > retrying.
> > > >
> > > > ...
> > > > Duplicates can arise even for other errors (e.g., request timed out).
> > So
> > > > that side-effect is not compelling enough to warrant a change to make
> > this
> > > > non-retriable. "
> > > >
> > > > *(TL;DR;  )  Should exceptions where retries can cause duplicates
> > should
> > > > still be *
> > > > *retriable?*
> > > >
> > > > Gwen
> > > >
> >
> >
>



-- 
Thanks,
Neha

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Posted by Jay Kreps <ja...@gmail.com>.

Yeah, hey Joel, that was a super long winded way of saying let's leave it
at Retriable. I agree there is another concept which is "non duplicate
producing" but when we do the idempotence stuff then all things will have
that property so it may be okay just to leave it for now since network
errors make it impossible to really do this properly without idempotence.

-Jay

On Wed, Feb 11, 2015 at 5:41 AM, Joel Koshy <jj...@gmail.com> wrote:

> Thanks for the comments - however, it is not clear to me what your
> preference is on making NotEnoughReplicasAfterAppend retriable vs
> non-retriable.
>
> As for me, my preference is to leave it as retriable since it is clear
> that the produce may succeed on a retry (and may introduce a
> duplicate). I agree that idempotence will bring full closure to this
> though.
>
> Anyone else have a preference on this?
>
> Thanks,
>
> Joel
>
> On Tue, Feb 10, 2015 at 08:23:08PM -0800, Jay Kreps wrote:
> > Yeah there are really two concepts here as I think you noted:
> > 1. Retry safe: we know that the write did not occur
> > 2. Retry fixable: if you send that again it could work
> >
> > (probably there are better names for these).
> >
> > Some things we know did not do a write and may be fixed by retrying (no
> > leader). Some things we know didn't do a write and are not worth retrying
> > (message too large). Somethings we don't know and are worth retrying
> > (network error), and probably some things we don't know and aren't worth
> it
> > (can't think of one though).
> >
> > (I feel like Donald Rumsfeld with the "known unknowns" thing).
> >
> > In the current world if you set retries > 0 you are saying "I accept
> > duplicates but want to ensure my stuff gets written", if you set retries
> =
> > 0 you are saying "I can't abide duplicates and am willing to tolerate
> > loss". So Retryable for us means "retry may succeed".
> >
> > Originally I thought of maybe trying to model both concepts. However the
> > two arguments against it are:
> > 1. Even if you do this the guarantee remains "at least once delivery"
> > because: (1) in the network error case you just don't know, (2) consumer
> > failure.
> > 2. The proper fix for this is to add idempotence support on the server,
> > which we should do.
> >
> > Doing idempotence support on the server will actually fix all duplicate
> > problems, including the network error case (because of course the server
> > knows whether your write went through even though the client doesn't).
> When
> > we have that then the client can always just retry anything marked
> > Retriable (i.e. retry may work) without fear of duplicates.
> >
> > This gives exactly once delivery to the log, and a co-operating consumer
> > can use the offset to dedupe and get it end-to-end.
> >
> > So that was why I had just left one type of Retriable and used it to mean
> > "retry may work" and don't try to flag anything for duplicates.
> >
> > -Jay
> >
> >
> >
> >
> > On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <gs...@cloudera.com>
> wrote:
> >
> > > Hi Kafka Devs,
> > >
> > > Need your thoughts on retriable exceptions:
> > >
> > > If a user configures Kafka with min.isr > 1 and there are not enough
> > > replicas to safely store the data, there are two possibilities:
> > >
> > > 1. The lack of replicas was discovered before the message was written.
> We
> > > throw NotEnoughReplicas.
> > > 2. The lack of replicas was discovered after the message was written to
> > > leader. In this case, we throw  NotEnoughReplicasAfterAppend.
> > >
> > > Currently, both errors are Retriable. Which means that the new producer
> > > will retry multiple times.
> > > In case of the second exception, this will cause duplicates.
> > >
> > > KAFKA-1697 suggests:
> > > "we probably want to make NotEnoughReplicasAfterAppend a non-retriable
> > > exception and let the client decide what to do."
> > >
> > > I agreed that the client (the one using the Producer) should weight the
> > > problems duplicates will cause vs. the probability of losing the
> message
> > > and do something sensible and made the exception non-retriable.
> > >
> > > In the RB (https://reviews.apache.org/r/29647/) Joel raised a good
> point:
> > > (Joel, feel free to correct me if I misrepresented your point)
> > >
> > > "I think our interpretation of retriable is as follows (but we can
> discuss
> > > on the list if that needs to change): if the produce request hits an
> error,
> > > and there is absolutely no point in retrying then that is a
> non-retriable
> > > error. MessageSizeTooLarge is an example - since unless the producer
> > > changes the request to make the messages smaller there is no point in
> > > retrying.
> > >
> > > ...
> > > Duplicates can arise even for other errors (e.g., request timed out).
> So
> > > that side-effect is not compelling enough to warrant a change to make
> this
> > > non-retriable. "
> > >
> > > *(TL;DR;  )  Should exceptions where retries can cause duplicates
> should
> > > still be *
> > > *retriable?*
> > >
> > > Gwen
> > >
>
>

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Posted by Joel Koshy <jj...@gmail.com>.

Thanks for the comments - however, it is not clear to me what your
preference is on making NotEnoughReplicasAfterAppend retriable vs
non-retriable.

As for me, my preference is to leave it as retriable since it is clear
that the produce may succeed on a retry (and may introduce a
duplicate). I agree that idempotence will bring full closure to this
though.

Anyone else have a preference on this?

Thanks,

Joel

On Tue, Feb 10, 2015 at 08:23:08PM -0800, Jay Kreps wrote:
> Yeah there are really two concepts here as I think you noted:
> 1. Retry safe: we know that the write did not occur
> 2. Retry fixable: if you send that again it could work
> 
> (probably there are better names for these).
> 
> Some things we know did not do a write and may be fixed by retrying (no
> leader). Some things we know didn't do a write and are not worth retrying
> (message too large). Somethings we don't know and are worth retrying
> (network error), and probably some things we don't know and aren't worth it
> (can't think of one though).
> 
> (I feel like Donald Rumsfeld with the "known unknowns" thing).
> 
> In the current world if you set retries > 0 you are saying "I accept
> duplicates but want to ensure my stuff gets written", if you set retries =
> 0 you are saying "I can't abide duplicates and am willing to tolerate
> loss". So Retryable for us means "retry may succeed".
> 
> Originally I thought of maybe trying to model both concepts. However the
> two arguments against it are:
> 1. Even if you do this the guarantee remains "at least once delivery"
> because: (1) in the network error case you just don't know, (2) consumer
> failure.
> 2. The proper fix for this is to add idempotence support on the server,
> which we should do.
> 
> Doing idempotence support on the server will actually fix all duplicate
> problems, including the network error case (because of course the server
> knows whether your write went through even though the client doesn't). When
> we have that then the client can always just retry anything marked
> Retriable (i.e. retry may work) without fear of duplicates.
> 
> This gives exactly once delivery to the log, and a co-operating consumer
> can use the offset to dedupe and get it end-to-end.
> 
> So that was why I had just left one type of Retriable and used it to mean
> "retry may work" and don't try to flag anything for duplicates.
> 
> -Jay
> 
> 
> 
> 
> On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <gs...@cloudera.com> wrote:
> 
> > Hi Kafka Devs,
> >
> > Need your thoughts on retriable exceptions:
> >
> > If a user configures Kafka with min.isr > 1 and there are not enough
> > replicas to safely store the data, there are two possibilities:
> >
> > 1. The lack of replicas was discovered before the message was written. We
> > throw NotEnoughReplicas.
> > 2. The lack of replicas was discovered after the message was written to
> > leader. In this case, we throw  NotEnoughReplicasAfterAppend.
> >
> > Currently, both errors are Retriable. Which means that the new producer
> > will retry multiple times.
> > In case of the second exception, this will cause duplicates.
> >
> > KAFKA-1697 suggests:
> > "we probably want to make NotEnoughReplicasAfterAppend a non-retriable
> > exception and let the client decide what to do."
> >
> > I agreed that the client (the one using the Producer) should weight the
> > problems duplicates will cause vs. the probability of losing the message
> > and do something sensible and made the exception non-retriable.
> >
> > In the RB (https://reviews.apache.org/r/29647/) Joel raised a good point:
> > (Joel, feel free to correct me if I misrepresented your point)
> >
> > "I think our interpretation of retriable is as follows (but we can discuss
> > on the list if that needs to change): if the produce request hits an error,
> > and there is absolutely no point in retrying then that is a non-retriable
> > error. MessageSizeTooLarge is an example - since unless the producer
> > changes the request to make the messages smaller there is no point in
> > retrying.
> >
> > ...
> > Duplicates can arise even for other errors (e.g., request timed out). So
> > that side-effect is not compelling enough to warrant a change to make this
> > non-retriable. "
> >
> > *(TL;DR;  )  Should exceptions where retries can cause duplicates should
> > still be *
> > *retriable?*
> >
> > Gwen
> >

Re: [DISCUSSION] KAFKA-1697 - make NotEnoughReplicasAfterAppend a non-retriable exception

Posted by Jay Kreps <ja...@gmail.com>.

Yeah there are really two concepts here as I think you noted:
1. Retry safe: we know that the write did not occur
2. Retry fixable: if you send that again it could work

(probably there are better names for these).

Some things we know did not do a write and may be fixed by retrying (no
leader). Some things we know didn't do a write and are not worth retrying
(message too large). Somethings we don't know and are worth retrying
(network error), and probably some things we don't know and aren't worth it
(can't think of one though).

(I feel like Donald Rumsfeld with the "known unknowns" thing).

In the current world if you set retries > 0 you are saying "I accept
duplicates but want to ensure my stuff gets written", if you set retries =
0 you are saying "I can't abide duplicates and am willing to tolerate
loss". So Retryable for us means "retry may succeed".

Originally I thought of maybe trying to model both concepts. However the
two arguments against it are:
1. Even if you do this the guarantee remains "at least once delivery"
because: (1) in the network error case you just don't know, (2) consumer
failure.
2. The proper fix for this is to add idempotence support on the server,
which we should do.

Doing idempotence support on the server will actually fix all duplicate
problems, including the network error case (because of course the server
knows whether your write went through even though the client doesn't). When
we have that then the client can always just retry anything marked
Retriable (i.e. retry may work) without fear of duplicates.

This gives exactly once delivery to the log, and a co-operating consumer
can use the offset to dedupe and get it end-to-end.

So that was why I had just left one type of Retriable and used it to mean
"retry may work" and don't try to flag anything for duplicates.

-Jay

On Tue, Feb 10, 2015 at 4:32 PM, Gwen Shapira <gs...@cloudera.com> wrote:

> Hi Kafka Devs,
>
> Need your thoughts on retriable exceptions:
>
> If a user configures Kafka with min.isr > 1 and there are not enough
> replicas to safely store the data, there are two possibilities:
>
> 1. The lack of replicas was discovered before the message was written. We
> throw NotEnoughReplicas.
> 2. The lack of replicas was discovered after the message was written to
> leader. In this case, we throw  NotEnoughReplicasAfterAppend.
>
> Currently, both errors are Retriable. Which means that the new producer
> will retry multiple times.
> In case of the second exception, this will cause duplicates.
>
> KAFKA-1697 suggests:
> "we probably want to make NotEnoughReplicasAfterAppend a non-retriable
> exception and let the client decide what to do."
>
> I agreed that the client (the one using the Producer) should weight the
> problems duplicates will cause vs. the probability of losing the message
> and do something sensible and made the exception non-retriable.
>
> In the RB (https://reviews.apache.org/r/29647/) Joel raised a good point:
> (Joel, feel free to correct me if I misrepresented your point)
>
> "I think our interpretation of retriable is as follows (but we can discuss
> on the list if that needs to change): if the produce request hits an error,
> and there is absolutely no point in retrying then that is a non-retriable
> error. MessageSizeTooLarge is an example - since unless the producer
> changes the request to make the messages smaller there is no point in
> retrying.
>
> ...
> Duplicates can arise even for other errors (e.g., request timed out). So
> that side-effect is not compelling enough to warrant a change to make this
> non-retriable. "
>
> *(TL;DR;  )  Should exceptions where retries can cause duplicates should
> still be *
> *retriable?*
>
> Gwen
>