You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Xiaoyu Wang <xw...@rocketfuel.com> on 2014/12/14 16:20:22 UTC

metrics about how behind a replica is?

Hello,

If I understand it correctly, when the number of messages a replica is
behind from the leader is < replica.lag.max.messages, the replica is
considered in sync with the master and are eligible for leader election.

This means we can lose at most replica.lag.max.messages messages during
leader election, is it? We can set the replica.lag.max.messages to be very
low, but then we may result in unclean leader election, so still we can
lose data.

Can you recommend some way to prevent data loss? We have tried setting
require ack from all replicas, but that slows down producer significantly.

In addition, do we have metrics about how far each replica is behind? If
not, can we add them.


Thanks,

Re: metrics about how behind a replica is?

Posted by Jun Rao <ju...@confluent.io>.

In 0.8.2, we have a new java producer that allows to you specify a callback
for each message to be sent.

Thanks,

Jun

On Thu, Dec 18, 2014 at 12:07 PM, Xiaoyu Wang <xw...@rocketfuel.com> wrote:

> @Jun, We can increase the number of resends, but the produce request may
> still fail.
>
> For async producer, at the time when it fails, we have
>
>    - messages that are in queue but has not been sent. From javaapi, we
>    don't know which messages are still in queue.
>
>
>    - Is it possible that we expose the blocking queue size so we know what
>       remains in the queue?
>
>
>    - messages we have failed retrying. For the last batch, some may have
>    succeeded, but some failed retrying. From javaapi, we don't know what
> are
>    the messages failed.
>       - Is it possible to dump the failed messages to a file so that the
>       next run can pick them up?
>
> Does this make sense? Is there other way you will recommend to keep track
> of messages that have been sent for async producer?
>
> Thanks
>
>
>
> On Wed, Dec 17, 2014 at 10:58 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> > You can configure the number of resends on the producer.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Dec 17, 2014 at 10:34 AM, Xiaoyu Wang <xw...@rocketfuel.com>
> > wrote:
> > >
> > > I have tested using "async" producer with "required.ack=-1" and got
> > really
> > > good performance.
> > >
> > > We have not used async producer much previously, any potential dataloss
> > > when a broker goes down? For example, when a broker goes down, does
> > > producer resend all the messages in a batch?
> > >
> > >
> > > On Wed, Dec 17, 2014 at 1:16 PM, Xiaoyu Wang <xw...@rocketfuel.com>
> > wrote:
> > > >
> > > > Thanks Jun.
> > > >
> > > > We have tested our producer with the different required.ack config.
> > Even
> > > > with the required.ack=1, the producer is > 10 times slower than with
> > > > required.ack=0. Does this confirm with your  testing?
> > > >
> > > > I saw the presentation of LinkedIn Kafka SRE. Wondering what
> > > configuration
> > > > you guys have at LinkedIn to guarantee zero data loss.
> > > >
> > > > Thanks again and really appreciate your help!
> > > >
> > > > On Tue, Dec 16, 2014 at 9:50 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >>
> > > >> replica.lag.max.messages only controls when a replica should be
> > dropped
> > > >> out
> > > >> of the in-sync replica set (ISR). For a message to be considered
> > > >> committed,
> > > >> it has to be added to every replica in ISR. When the producer uses
> > > ack=-1,
> > > >> the broker waits until the produced message is committed before
> > > >> acknowledging the client. So in the case of a clean leader election
> > > (i.e.,
> > > >> there is at least one remaining replica in ISR), no committed
> messages
> > > are
> > > >> lost. In the case of an unclean leader election, the number of
> > messages
> > > >> that can be lost depends on the state of the replicas and it's
> > possible
> > > to
> > > >> lose more than replica.lag.max.messages messages.
> > > >>
> > > >> We do have the lag jmx metric per replica (see
> > > >> http://kafka.apache.org/documentation.html#monitoring).
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jun
> > > >>
> > > >> On Sun, Dec 14, 2014 at 7:20 AM, Xiaoyu Wang <xw...@rocketfuel.com>
> > > >> wrote:
> > > >> >
> > > >> > Hello,
> > > >> >
> > > >> > If I understand it correctly, when the number of messages a
> replica
> > is
> > > >> > behind from the leader is < replica.lag.max.messages, the replica
> is
> > > >> > considered in sync with the master and are eligible for leader
> > > election.
> > > >> >
> > > >> > This means we can lose at most replica.lag.max.messages messages
> > > during
> > > >> > leader election, is it? We can set the replica.lag.max.messages to
> > be
> > > >> very
> > > >> > low, but then we may result in unclean leader election, so still
> we
> > > can
> > > >> > lose data.
> > > >> >
> > > >> > Can you recommend some way to prevent data loss? We have tried
> > setting
> > > >> > require ack from all replicas, but that slows down producer
> > > >> significantly.
> > > >> >
> > > >> > In addition, do we have metrics about how far each replica is
> > behind?
> > > If
> > > >> > not, can we add them.
> > > >> >
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: metrics about how behind a replica is?

Posted by Xiaoyu Wang <xw...@rocketfuel.com>.

@Jun, We can increase the number of resends, but the produce request may
still fail.

For async producer, at the time when it fails, we have

   - messages that are in queue but has not been sent. From javaapi, we
   don't know which messages are still in queue.


   - Is it possible that we expose the blocking queue size so we know what
      remains in the queue?


   - messages we have failed retrying. For the last batch, some may have
   succeeded, but some failed retrying. From javaapi, we don't know what are
   the messages failed.
      - Is it possible to dump the failed messages to a file so that the
      next run can pick them up?

Does this make sense? Is there other way you will recommend to keep track
of messages that have been sent for async producer?

Thanks



On Wed, Dec 17, 2014 at 10:58 AM, Jun Rao <ju...@confluent.io> wrote:
>
> You can configure the number of resends on the producer.
>
> Thanks,
>
> Jun
>
> On Wed, Dec 17, 2014 at 10:34 AM, Xiaoyu Wang <xw...@rocketfuel.com>
> wrote:
> >
> > I have tested using "async" producer with "required.ack=-1" and got
> really
> > good performance.
> >
> > We have not used async producer much previously, any potential dataloss
> > when a broker goes down? For example, when a broker goes down, does
> > producer resend all the messages in a batch?
> >
> >
> > On Wed, Dec 17, 2014 at 1:16 PM, Xiaoyu Wang <xw...@rocketfuel.com>
> wrote:
> > >
> > > Thanks Jun.
> > >
> > > We have tested our producer with the different required.ack config.
> Even
> > > with the required.ack=1, the producer is > 10 times slower than with
> > > required.ack=0. Does this confirm with your  testing?
> > >
> > > I saw the presentation of LinkedIn Kafka SRE. Wondering what
> > configuration
> > > you guys have at LinkedIn to guarantee zero data loss.
> > >
> > > Thanks again and really appreciate your help!
> > >
> > > On Tue, Dec 16, 2014 at 9:50 PM, Jun Rao <ju...@confluent.io> wrote:
> > >>
> > >> replica.lag.max.messages only controls when a replica should be
> dropped
> > >> out
> > >> of the in-sync replica set (ISR). For a message to be considered
> > >> committed,
> > >> it has to be added to every replica in ISR. When the producer uses
> > ack=-1,
> > >> the broker waits until the produced message is committed before
> > >> acknowledging the client. So in the case of a clean leader election
> > (i.e.,
> > >> there is at least one remaining replica in ISR), no committed messages
> > are
> > >> lost. In the case of an unclean leader election, the number of
> messages
> > >> that can be lost depends on the state of the replicas and it's
> possible
> > to
> > >> lose more than replica.lag.max.messages messages.
> > >>
> > >> We do have the lag jmx metric per replica (see
> > >> http://kafka.apache.org/documentation.html#monitoring).
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >> On Sun, Dec 14, 2014 at 7:20 AM, Xiaoyu Wang <xw...@rocketfuel.com>
> > >> wrote:
> > >> >
> > >> > Hello,
> > >> >
> > >> > If I understand it correctly, when the number of messages a replica
> is
> > >> > behind from the leader is < replica.lag.max.messages, the replica is
> > >> > considered in sync with the master and are eligible for leader
> > election.
> > >> >
> > >> > This means we can lose at most replica.lag.max.messages messages
> > during
> > >> > leader election, is it? We can set the replica.lag.max.messages to
> be
> > >> very
> > >> > low, but then we may result in unclean leader election, so still we
> > can
> > >> > lose data.
> > >> >
> > >> > Can you recommend some way to prevent data loss? We have tried
> setting
> > >> > require ack from all replicas, but that slows down producer
> > >> significantly.
> > >> >
> > >> > In addition, do we have metrics about how far each replica is
> behind?
> > If
> > >> > not, can we add them.
> > >> >
> > >> >
> > >> > Thanks,
> > >> >
> > >>
> > >
> >
>

Re: metrics about how behind a replica is?

Posted by Jun Rao <ju...@confluent.io>.

You can configure the number of resends on the producer.

Thanks,

Jun

On Wed, Dec 17, 2014 at 10:34 AM, Xiaoyu Wang <xw...@rocketfuel.com> wrote:
>
> I have tested using "async" producer with "required.ack=-1" and got really
> good performance.
>
> We have not used async producer much previously, any potential dataloss
> when a broker goes down? For example, when a broker goes down, does
> producer resend all the messages in a batch?
>
>
> On Wed, Dec 17, 2014 at 1:16 PM, Xiaoyu Wang <xw...@rocketfuel.com> wrote:
> >
> > Thanks Jun.
> >
> > We have tested our producer with the different required.ack config. Even
> > with the required.ack=1, the producer is > 10 times slower than with
> > required.ack=0. Does this confirm with your  testing?
> >
> > I saw the presentation of LinkedIn Kafka SRE. Wondering what
> configuration
> > you guys have at LinkedIn to guarantee zero data loss.
> >
> > Thanks again and really appreciate your help!
> >
> > On Tue, Dec 16, 2014 at 9:50 PM, Jun Rao <ju...@confluent.io> wrote:
> >>
> >> replica.lag.max.messages only controls when a replica should be dropped
> >> out
> >> of the in-sync replica set (ISR). For a message to be considered
> >> committed,
> >> it has to be added to every replica in ISR. When the producer uses
> ack=-1,
> >> the broker waits until the produced message is committed before
> >> acknowledging the client. So in the case of a clean leader election
> (i.e.,
> >> there is at least one remaining replica in ISR), no committed messages
> are
> >> lost. In the case of an unclean leader election, the number of messages
> >> that can be lost depends on the state of the replicas and it's possible
> to
> >> lose more than replica.lag.max.messages messages.
> >>
> >> We do have the lag jmx metric per replica (see
> >> http://kafka.apache.org/documentation.html#monitoring).
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >> On Sun, Dec 14, 2014 at 7:20 AM, Xiaoyu Wang <xw...@rocketfuel.com>
> >> wrote:
> >> >
> >> > Hello,
> >> >
> >> > If I understand it correctly, when the number of messages a replica is
> >> > behind from the leader is < replica.lag.max.messages, the replica is
> >> > considered in sync with the master and are eligible for leader
> election.
> >> >
> >> > This means we can lose at most replica.lag.max.messages messages
> during
> >> > leader election, is it? We can set the replica.lag.max.messages to be
> >> very
> >> > low, but then we may result in unclean leader election, so still we
> can
> >> > lose data.
> >> >
> >> > Can you recommend some way to prevent data loss? We have tried setting
> >> > require ack from all replicas, but that slows down producer
> >> significantly.
> >> >
> >> > In addition, do we have metrics about how far each replica is behind?
> If
> >> > not, can we add them.
> >> >
> >> >
> >> > Thanks,
> >> >
> >>
> >
>

Re: metrics about how behind a replica is?

Posted by Xiaoyu Wang <xw...@rocketfuel.com>.

I have tested using "async" producer with "required.ack=-1" and got really
good performance.

We have not used async producer much previously, any potential dataloss
when a broker goes down? For example, when a broker goes down, does
producer resend all the messages in a batch?


On Wed, Dec 17, 2014 at 1:16 PM, Xiaoyu Wang <xw...@rocketfuel.com> wrote:
>
> Thanks Jun.
>
> We have tested our producer with the different required.ack config. Even
> with the required.ack=1, the producer is > 10 times slower than with
> required.ack=0. Does this confirm with your  testing?
>
> I saw the presentation of LinkedIn Kafka SRE. Wondering what configuration
> you guys have at LinkedIn to guarantee zero data loss.
>
> Thanks again and really appreciate your help!
>
> On Tue, Dec 16, 2014 at 9:50 PM, Jun Rao <ju...@confluent.io> wrote:
>>
>> replica.lag.max.messages only controls when a replica should be dropped
>> out
>> of the in-sync replica set (ISR). For a message to be considered
>> committed,
>> it has to be added to every replica in ISR. When the producer uses ack=-1,
>> the broker waits until the produced message is committed before
>> acknowledging the client. So in the case of a clean leader election (i.e.,
>> there is at least one remaining replica in ISR), no committed messages are
>> lost. In the case of an unclean leader election, the number of messages
>> that can be lost depends on the state of the replicas and it's possible to
>> lose more than replica.lag.max.messages messages.
>>
>> We do have the lag jmx metric per replica (see
>> http://kafka.apache.org/documentation.html#monitoring).
>>
>> Thanks,
>>
>> Jun
>>
>> On Sun, Dec 14, 2014 at 7:20 AM, Xiaoyu Wang <xw...@rocketfuel.com>
>> wrote:
>> >
>> > Hello,
>> >
>> > If I understand it correctly, when the number of messages a replica is
>> > behind from the leader is < replica.lag.max.messages, the replica is
>> > considered in sync with the master and are eligible for leader election.
>> >
>> > This means we can lose at most replica.lag.max.messages messages during
>> > leader election, is it? We can set the replica.lag.max.messages to be
>> very
>> > low, but then we may result in unclean leader election, so still we can
>> > lose data.
>> >
>> > Can you recommend some way to prevent data loss? We have tried setting
>> > require ack from all replicas, but that slows down producer
>> significantly.
>> >
>> > In addition, do we have metrics about how far each replica is behind? If
>> > not, can we add them.
>> >
>> >
>> > Thanks,
>> >
>>
>

Re: metrics about how behind a replica is?

Posted by Xiaoyu Wang <xw...@rocketfuel.com>.

Thanks Jun.

We have tested our producer with the different required.ack config. Even
with the required.ack=1, the producer is > 10 times slower than with
required.ack=0. Does this confirm with your  testing?

I saw the presentation of LinkedIn Kafka SRE. Wondering what configuration
you guys have at LinkedIn to guarantee zero data loss.

Thanks again and really appreciate your help!

On Tue, Dec 16, 2014 at 9:50 PM, Jun Rao <ju...@confluent.io> wrote:
>
> replica.lag.max.messages only controls when a replica should be dropped out
> of the in-sync replica set (ISR). For a message to be considered committed,
> it has to be added to every replica in ISR. When the producer uses ack=-1,
> the broker waits until the produced message is committed before
> acknowledging the client. So in the case of a clean leader election (i.e.,
> there is at least one remaining replica in ISR), no committed messages are
> lost. In the case of an unclean leader election, the number of messages
> that can be lost depends on the state of the replicas and it's possible to
> lose more than replica.lag.max.messages messages.
>
> We do have the lag jmx metric per replica (see
> http://kafka.apache.org/documentation.html#monitoring).
>
> Thanks,
>
> Jun
>
> On Sun, Dec 14, 2014 at 7:20 AM, Xiaoyu Wang <xw...@rocketfuel.com> wrote:
> >
> > Hello,
> >
> > If I understand it correctly, when the number of messages a replica is
> > behind from the leader is < replica.lag.max.messages, the replica is
> > considered in sync with the master and are eligible for leader election.
> >
> > This means we can lose at most replica.lag.max.messages messages during
> > leader election, is it? We can set the replica.lag.max.messages to be
> very
> > low, but then we may result in unclean leader election, so still we can
> > lose data.
> >
> > Can you recommend some way to prevent data loss? We have tried setting
> > require ack from all replicas, but that slows down producer
> significantly.
> >
> > In addition, do we have metrics about how far each replica is behind? If
> > not, can we add them.
> >
> >
> > Thanks,
> >
>

Re: metrics about how behind a replica is?

Posted by Jun Rao <ju...@confluent.io>.

replica.lag.max.messages only controls when a replica should be dropped out
of the in-sync replica set (ISR). For a message to be considered committed,
it has to be added to every replica in ISR. When the producer uses ack=-1,
the broker waits until the produced message is committed before
acknowledging the client. So in the case of a clean leader election (i.e.,
there is at least one remaining replica in ISR), no committed messages are
lost. In the case of an unclean leader election, the number of messages
that can be lost depends on the state of the replicas and it's possible to
lose more than replica.lag.max.messages messages.

We do have the lag jmx metric per replica (see
http://kafka.apache.org/documentation.html#monitoring).

Thanks,

Jun

On Sun, Dec 14, 2014 at 7:20 AM, Xiaoyu Wang <xw...@rocketfuel.com> wrote:
>
> Hello,
>
> If I understand it correctly, when the number of messages a replica is
> behind from the leader is < replica.lag.max.messages, the replica is
> considered in sync with the master and are eligible for leader election.
>
> This means we can lose at most replica.lag.max.messages messages during
> leader election, is it? We can set the replica.lag.max.messages to be very
> low, but then we may result in unclean leader election, so still we can
> lose data.
>
> Can you recommend some way to prevent data loss? We have tried setting
> require ack from all replicas, but that slows down producer significantly.
>
> In addition, do we have metrics about how far each replica is behind? If
> not, can we add them.
>
>
> Thanks,
>