You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by adrien ruffie <ad...@hotmail.fr> on 2018/03/01 10:59:28 UTC

RE: difference between 2 options

Sorry Andras, the the delay of my response.


Ok I correctly understood for the deletion thank to your explanation.


however, for recovery point I wanted to ask you, the concept's logic:


For example I have one recovery-point-offset-checkpoint in topic-0


If the broker crashed, and restarted:


the fact that a recovery-point-offset-checkpoint is present, this avoid recovering the whole log during startup.

But what does that mean exactly ? Only one offset number is present in this recovering file ?

If is the case: le broker will simply load in memory all messages in this log from this offset?


I really want to correctly understand the concept 😊


Best regards,


Adrien

________________________________
De : Andras Beni <an...@cloudera.com>
Envoyé : mardi 27 février 2018 15:41:04
À : users@kafka.apache.org
Objet : Re: difference between 2 options

1) We write out one recovery point per log directory, which practically
means topicpartition. So if your topic is called mytopic, then you will
have a file called

recovery-point-offset-checkpoint in topic-0/ , in topic-1/ , and in
topic-2/ .

2) Data deletion in kafka is not related to what was read by consumers.
Data is deleted when there is either to much of it (log.retention.bytes
property) or it is too old (log.retention.ms property). And consumers keep
track of what they have consumed using the __consumer_offsets topic (or
some custom logic they choose).
What we are talking about is DeleteRecordsRequest. It is sent by a command
line tool called kafka.admin.DeleteRecordsCommand. This does not actually
delete any data but notes that the data before a given offset should not be
served anymore. This, just like recovery checkpointing, works on a
per-partition basis.

Does this answer your questions?

Best regards,
Andras


On Mon, Feb 26, 2018 at 11:43 PM, adrien ruffie <ad...@hotmail.fr>
wrote:

> Hi Andras,
>
>
> thank for your response !
>
> For log.flush.offset.checkpoint.interval.ms we write out only one
> recovery point for all logs ?
>
> But if I have 3 partitions, and for each partition the offset is
> different, what's happen ? We save in
>
> text file 3 different offset ? Or just only one for the three partitions ?
>
>
> When you say "to avoid exposing data that have been deleted by
> DeleteRecordsRequest"
>
> It means the old consumed data ? For example I have 34700 offset, it's to
> avoid reexposing
>
> 34000~34699 records to consumer after crash ?
>
> ________________________________
> De : Andras Beni <an...@cloudera.com>
> Envoyé : mardi 27 février 2018 06:16:41
> À : users@kafka.apache.org
> Objet : Re: difference between 2 options
>
> Hi Adrien,
>
> Every log.flush.offset.checkpoint.interval.ms  we write out the current
> recovery point for all logs to a text file in the log directory to avoid
> recovering the whole log on startup.
>
> and every log.flush.start.offset.checkpoint.interval.ms we write out the
> current log start offset for all logs to a text file in the log directory
> to avoid exposing data that have been deleted by DeleteRecordsRequest
>
> HTH,
> Andras
>
>
> On Mon, Feb 26, 2018 at 1:51 PM, adrien ruffie <ad...@hotmail.fr>
> wrote:
>
> > Hello all,
> >
> >
> > I have read linked porperties documentation, but I don't really
> understand
> > the difference between:
> >
> > log.flush.offset.checkpoint.interval.ms
> >
> >
> > and
> >
> >
> > log.flush.start.offset.checkpoint.interval.ms
> >
> >
> > Do you have a usecase of each property's utilization, I can't figure out
> > what the difference ...
> >
> >
> > best regards,
> >
> >
> > Adrien
> >
>

RE: difference between 2 options

Posted by adrien ruffie <ad...@hotmail.fr>.
Perfectly Andras ! thank a lot.

I noted all of your explanations 😊 .


best regards,

Adrien

________________________________
De : Andras Beni <an...@cloudera.com>
Envoyé : samedi 3 mars 2018 09:29:16
À : users@kafka.apache.org
Objet : Re: difference between 2 options

Hello Adrien,

I was wrong. There is only one such file per data dir and not one per
topicpartition dir. It is a text file containing
 - a format version number (0),
 - number of following entries, and
 - one entry for each topicpartition: topic name, partition and offset.

Yes, when the broker starts, it checks these entries. As you probably know,
one topicpartition is written to multiple log segments. If the broker finds
that there are messages after the recovery point, each log segment that
contains such messages will be iterated over and the messages will be
checked and a new index will be built.

I hope this answers your questions.

Best regards,
Andras



On Thu, Mar 1, 2018 at 2:59 AM, adrien ruffie <ad...@hotmail.fr>
wrote:

> Sorry Andras, the the delay of my response.
>
>
> Ok I correctly understood for the deletion thank to your explanation.
>
>
> however, for recovery point I wanted to ask you, the concept's logic:
>
>
> For example I have one recovery-point-offset-checkpoint in topic-0
>
>
> If the broker crashed, and restarted:
>
>
> the fact that a recovery-point-offset-checkpoint is present, this avoid
> recovering the whole log during startup.
>
> But what does that mean exactly ? Only one offset number is present in
> this recovering file ?
>
> If is the case: le broker will simply load in memory all messages in this
> log from this offset?
>
>
> I really want to correctly understand the concept 😊
>
>
> Best regards,
>
>
> Adrien
>
> ________________________________
> De : Andras Beni <an...@cloudera.com>
> Envoyé : mardi 27 février 2018 15:41:04
> À : users@kafka.apache.org
> Objet : Re: difference between 2 options
>
> 1) We write out one recovery point per log directory, which practically
> means topicpartition. So if your topic is called mytopic, then you will
> have a file called
>
> recovery-point-offset-checkpoint in topic-0/ , in topic-1/ , and in
> topic-2/ .
>
> 2) Data deletion in kafka is not related to what was read by consumers.
> Data is deleted when there is either to much of it (log.retention.bytes
> property) or it is too old (log.retention.ms property). And consumers keep
> track of what they have consumed using the __consumer_offsets topic (or
> some custom logic they choose).
> What we are talking about is DeleteRecordsRequest. It is sent by a command
> line tool called kafka.admin.DeleteRecordsCommand. This does not actually
> delete any data but notes that the data before a given offset should not be
> served anymore. This, just like recovery checkpointing, works on a
> per-partition basis.
>
> Does this answer your questions?
>
> Best regards,
> Andras
>
>
> On Mon, Feb 26, 2018 at 11:43 PM, adrien ruffie <adriennolarsen@hotmail.fr
> >
> wrote:
>
> > Hi Andras,
> >
> >
> > thank for your response !
> >
> > For log.flush.offset.checkpoint.interval.ms we write out only one
> > recovery point for all logs ?
> >
> > But if I have 3 partitions, and for each partition the offset is
> > different, what's happen ? We save in
> >
> > text file 3 different offset ? Or just only one for the three partitions
> ?
> >
> >
> > When you say "to avoid exposing data that have been deleted by
> > DeleteRecordsRequest"
> >
> > It means the old consumed data ? For example I have 34700 offset, it's to
> > avoid reexposing
> >
> > 34000~34699 records to consumer after crash ?
> >
> > ________________________________
> > De : Andras Beni <an...@cloudera.com>
> > Envoyé : mardi 27 février 2018 06:16:41
> > À : users@kafka.apache.org
> > Objet : Re: difference between 2 options
> >
> > Hi Adrien,
> >
> > Every log.flush.offset.checkpoint.interval.ms  we write out the current
> > recovery point for all logs to a text file in the log directory to avoid
> > recovering the whole log on startup.
> >
> > and every log.flush.start.offset.checkpoint.interval.ms we write out the
> > current log start offset for all logs to a text file in the log directory
> > to avoid exposing data that have been deleted by DeleteRecordsRequest
> >
> > HTH,
> > Andras
> >
> >
> > On Mon, Feb 26, 2018 at 1:51 PM, adrien ruffie <
> adriennolarsen@hotmail.fr>
> > wrote:
> >
> > > Hello all,
> > >
> > >
> > > I have read linked porperties documentation, but I don't really
> > understand
> > > the difference between:
> > >
> > > log.flush.offset.checkpoint.interval.ms
> > >
> > >
> > > and
> > >
> > >
> > > log.flush.start.offset.checkpoint.interval.ms
> > >
> > >
> > > Do you have a usecase of each property's utilization, I can't figure
> out
> > > what the difference ...
> > >
> > >
> > > best regards,
> > >
> > >
> > > Adrien
> > >
> >
>

Re: difference between 2 options

Posted by Andras Beni <an...@cloudera.com>.
Hello Adrien,

I was wrong. There is only one such file per data dir and not one per
topicpartition dir. It is a text file containing
 - a format version number (0),
 - number of following entries, and
 - one entry for each topicpartition: topic name, partition and offset.

Yes, when the broker starts, it checks these entries. As you probably know,
one topicpartition is written to multiple log segments. If the broker finds
that there are messages after the recovery point, each log segment that
contains such messages will be iterated over and the messages will be
checked and a new index will be built.

I hope this answers your questions.

Best regards,
Andras



On Thu, Mar 1, 2018 at 2:59 AM, adrien ruffie <ad...@hotmail.fr>
wrote:

> Sorry Andras, the the delay of my response.
>
>
> Ok I correctly understood for the deletion thank to your explanation.
>
>
> however, for recovery point I wanted to ask you, the concept's logic:
>
>
> For example I have one recovery-point-offset-checkpoint in topic-0
>
>
> If the broker crashed, and restarted:
>
>
> the fact that a recovery-point-offset-checkpoint is present, this avoid
> recovering the whole log during startup.
>
> But what does that mean exactly ? Only one offset number is present in
> this recovering file ?
>
> If is the case: le broker will simply load in memory all messages in this
> log from this offset?
>
>
> I really want to correctly understand the concept 😊
>
>
> Best regards,
>
>
> Adrien
>
> ________________________________
> De : Andras Beni <an...@cloudera.com>
> Envoyé : mardi 27 février 2018 15:41:04
> À : users@kafka.apache.org
> Objet : Re: difference between 2 options
>
> 1) We write out one recovery point per log directory, which practically
> means topicpartition. So if your topic is called mytopic, then you will
> have a file called
>
> recovery-point-offset-checkpoint in topic-0/ , in topic-1/ , and in
> topic-2/ .
>
> 2) Data deletion in kafka is not related to what was read by consumers.
> Data is deleted when there is either to much of it (log.retention.bytes
> property) or it is too old (log.retention.ms property). And consumers keep
> track of what they have consumed using the __consumer_offsets topic (or
> some custom logic they choose).
> What we are talking about is DeleteRecordsRequest. It is sent by a command
> line tool called kafka.admin.DeleteRecordsCommand. This does not actually
> delete any data but notes that the data before a given offset should not be
> served anymore. This, just like recovery checkpointing, works on a
> per-partition basis.
>
> Does this answer your questions?
>
> Best regards,
> Andras
>
>
> On Mon, Feb 26, 2018 at 11:43 PM, adrien ruffie <adriennolarsen@hotmail.fr
> >
> wrote:
>
> > Hi Andras,
> >
> >
> > thank for your response !
> >
> > For log.flush.offset.checkpoint.interval.ms we write out only one
> > recovery point for all logs ?
> >
> > But if I have 3 partitions, and for each partition the offset is
> > different, what's happen ? We save in
> >
> > text file 3 different offset ? Or just only one for the three partitions
> ?
> >
> >
> > When you say "to avoid exposing data that have been deleted by
> > DeleteRecordsRequest"
> >
> > It means the old consumed data ? For example I have 34700 offset, it's to
> > avoid reexposing
> >
> > 34000~34699 records to consumer after crash ?
> >
> > ________________________________
> > De : Andras Beni <an...@cloudera.com>
> > Envoyé : mardi 27 février 2018 06:16:41
> > À : users@kafka.apache.org
> > Objet : Re: difference between 2 options
> >
> > Hi Adrien,
> >
> > Every log.flush.offset.checkpoint.interval.ms  we write out the current
> > recovery point for all logs to a text file in the log directory to avoid
> > recovering the whole log on startup.
> >
> > and every log.flush.start.offset.checkpoint.interval.ms we write out the
> > current log start offset for all logs to a text file in the log directory
> > to avoid exposing data that have been deleted by DeleteRecordsRequest
> >
> > HTH,
> > Andras
> >
> >
> > On Mon, Feb 26, 2018 at 1:51 PM, adrien ruffie <
> adriennolarsen@hotmail.fr>
> > wrote:
> >
> > > Hello all,
> > >
> > >
> > > I have read linked porperties documentation, but I don't really
> > understand
> > > the difference between:
> > >
> > > log.flush.offset.checkpoint.interval.ms
> > >
> > >
> > > and
> > >
> > >
> > > log.flush.start.offset.checkpoint.interval.ms
> > >
> > >
> > > Do you have a usecase of each property's utilization, I can't figure
> out
> > > what the difference ...
> > >
> > >
> > > best regards,
> > >
> > >
> > > Adrien
> > >
> >
>