You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kudu.apache.org by Dinesh Bhat <di...@cloudera.com> on 2016/08/18 14:05:14 UTC

Few Qs on log retention

Hi Todd,

While looking at KUDU-1408, which addresses catch-up-game between bootstrapping a new replica and log retention on existing replicas, few basic Qs popping up in mind:
Looking at description of LogAnchor class, what’s the purpose of log anchor generally.
Also, is this any different from rejoining a stale replica to the cluster ? (stale could mean going as far back as starting a new replica ?)
Instead of sending anchor logs via heartbeats(as you seem to be suggesting in the bug), could we not rely on the follower to suggest the leader where he wants to replay the logs from ? In other words, the ’tablet copy' begins with leader asking the follower what was his(or her ) last known index.  I am confident I am missing several pieces here.
Thanks,
Dinesh.

Re: Few Qs on log retention

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Aug 18, 2016 at 2:40 PM, Dinesh Bhat <di...@cloudera.com> wrote:

> Thanks a bunch Todd for these detailed explanations - some of those
> unclear Qs stemmed from the fact that I didn’t know the difference between
> resurrecting a stale replica vs bringing up a new replica and their
> relation to WALs and GCs. Your descriptions with  with examples provide
> helpful context to dwell some time on code reading.
>
> No problem, glad to help.


> Also, I wonder if I could Ctrl+CV some of the below points to an existing
> design doc.
>
>
Sure, would be great to include more info in a design doc, or if you find
places to insert this info into class headers that's great too. (It's often
easier to run into docs in the code vs finding in the external design-docs
dir)


> Thanks,
>
> > On Aug 18, 2016, at 11:51 AM, Todd Lipcon <to...@cloudera.com> wrote:
> >
> > On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <di...@cloudera.com>
> wrote:
> >
> >> Hi Todd,
> >>
> >> While looking at KUDU-1408, which addresses catch-up-game between
> >> bootstrapping a new replica and log retention on existing replicas, few
> >> basic Qs popping up in mind:
> >> Looking at description of LogAnchor class, what’s the purpose of log
> >> anchor generally.
> >>
> >
> > We use the term 'anchor' to mean preventing the garbage collection of old
> > log segments. The main reasons we anchor logs are currently:
> >
> > 1- Durability: if there is some data that was inserted into a memory
> store
> > (MRS or DMS) which has not yet been flushed, we need to retain the log
> > segment which has the original operation so that we can replay it
> >
> > 2- Time: we respect a configuration --log_min_seconds_to_retain and
> always
> > retain any log segment that was closed within that time bound. This is
> > useful so that, if a follower replica crashes or freezes temporarily and
> > then comes back within this time period, the leader will still have
> enough
> > back history of logs in order to "catch up" the follower.
> >
> > 3- Number of segments: a configuration --log_min_segments_to_retain
> > determines how many segments of WAL we keep regardless of the above
> > concerns. This is useful for things like debugging, and to prevent some
> > edge cases in the code that might happen if there are 0 WALs for a
> tablet.
> >
> > Log Anchors are primarily concerned with item #1 above (durability). They
> > allow a memory store to register an "anchor" corresponding to the
> operation
> > index of the first data item that mutated them, and that anchor is kept
> > alive until the store is durably flushed. The registry keeps the set of
> > live anchors so that it can quickly be interrogated during log GC.
> >
> >
> >
> >> Also, is this any different from rejoining a stale replica to the
> cluster
> >> ? (stale could mean going as far back as starting a new replica ?)
> >>
> >
> > Not sure exactly what you're asking here... maybe the above answers it?
> The
> > time-based retention's main goal is to keep some "back history" of each
> > tablet so that if a replica is only a little bit stale, it can restart
> and
> > catch up from the leader's logs. If the leader has already GCed the
> > necessary logs to catch up the replica (eg the replica was down for too
> > long) then it will have to be evicted and replaced as a new replica. In
> > that case, the new replica will start its life using the 'tablet copy'
> > mechanism to transfer a snapshot of one of the existing replicas. The
> > snapshot contains on-disk state as well as the WALs necessary to replay
> > from that state. You can think of it like an rsync from a good replica.
> >
> >
> >> Instead of sending anchor logs via heartbeats(as you seem to be
> suggesting
> >> in the bug), could we not rely on the follower to suggest the leader
> where
> >> he wants to replay the logs from ? In other words, the ’tablet copy'
> begins
> >> with leader asking the follower what was his(or her ) last known
> index.  I
> >> am confident I am missing several pieces here.
> >>
> >
> > The thing is that replication is leader-driven, and you may have a leader
> > re-election at any point. So, if you have three nodes A, B, and C, with A
> > as the leader, and C as a laggy replica, we want to ensure that neither A
> > *nor B* GCs the logs that C may need to catch up. Currently there is no
> > propagation of the information about how far behind C is back up to the
> WAL
> > system, and both A and B will happily GC logs based on the factors
> > mentioned up top in this email. One of the suggestions in KUDU-1408 is to
> > (1) propagate info on the lagging replica's index to the leader to
> prevent
> > GC on the leader, and (2) to have the leader propagate that information
> to
> > other followers (like 'B' in this example) so that they also don't GC
> that
> > log segment too early.
> >
> > Hope that helps
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Few Qs on log retention

Posted by Dinesh Bhat <di...@cloudera.com>.

Thanks a bunch Todd for these detailed explanations - some of those unclear Qs stemmed from the fact that I didn’t know the difference between resurrecting a stale replica vs bringing up a new replica and their relation to WALs and GCs. Your descriptions with  with examples provide helpful context to dwell some time on code reading. 

Also, I wonder if I could Ctrl+CV some of the below points to an existing design doc.

Thanks,

> On Aug 18, 2016, at 11:51 AM, Todd Lipcon <to...@cloudera.com> wrote:
> 
> On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <di...@cloudera.com> wrote:
> 
>> Hi Todd,
>> 
>> While looking at KUDU-1408, which addresses catch-up-game between
>> bootstrapping a new replica and log retention on existing replicas, few
>> basic Qs popping up in mind:
>> Looking at description of LogAnchor class, what’s the purpose of log
>> anchor generally.
>> 
> 
> We use the term 'anchor' to mean preventing the garbage collection of old
> log segments. The main reasons we anchor logs are currently:
> 
> 1- Durability: if there is some data that was inserted into a memory store
> (MRS or DMS) which has not yet been flushed, we need to retain the log
> segment which has the original operation so that we can replay it
> 
> 2- Time: we respect a configuration --log_min_seconds_to_retain and always
> retain any log segment that was closed within that time bound. This is
> useful so that, if a follower replica crashes or freezes temporarily and
> then comes back within this time period, the leader will still have enough
> back history of logs in order to "catch up" the follower.
> 
> 3- Number of segments: a configuration --log_min_segments_to_retain
> determines how many segments of WAL we keep regardless of the above
> concerns. This is useful for things like debugging, and to prevent some
> edge cases in the code that might happen if there are 0 WALs for a tablet.
> 
> Log Anchors are primarily concerned with item #1 above (durability). They
> allow a memory store to register an "anchor" corresponding to the operation
> index of the first data item that mutated them, and that anchor is kept
> alive until the store is durably flushed. The registry keeps the set of
> live anchors so that it can quickly be interrogated during log GC.
> 
> 
> 
>> Also, is this any different from rejoining a stale replica to the cluster
>> ? (stale could mean going as far back as starting a new replica ?)
>> 
> 
> Not sure exactly what you're asking here... maybe the above answers it? The
> time-based retention's main goal is to keep some "back history" of each
> tablet so that if a replica is only a little bit stale, it can restart and
> catch up from the leader's logs. If the leader has already GCed the
> necessary logs to catch up the replica (eg the replica was down for too
> long) then it will have to be evicted and replaced as a new replica. In
> that case, the new replica will start its life using the 'tablet copy'
> mechanism to transfer a snapshot of one of the existing replicas. The
> snapshot contains on-disk state as well as the WALs necessary to replay
> from that state. You can think of it like an rsync from a good replica.
> 
> 
>> Instead of sending anchor logs via heartbeats(as you seem to be suggesting
>> in the bug), could we not rely on the follower to suggest the leader where
>> he wants to replay the logs from ? In other words, the ’tablet copy' begins
>> with leader asking the follower what was his(or her ) last known index.  I
>> am confident I am missing several pieces here.
>> 
> 
> The thing is that replication is leader-driven, and you may have a leader
> re-election at any point. So, if you have three nodes A, B, and C, with A
> as the leader, and C as a laggy replica, we want to ensure that neither A
> *nor B* GCs the logs that C may need to catch up. Currently there is no
> propagation of the information about how far behind C is back up to the WAL
> system, and both A and B will happily GC logs based on the factors
> mentioned up top in this email. One of the suggestions in KUDU-1408 is to
> (1) propagate info on the lagging replica's index to the leader to prevent
> GC on the leader, and (2) to have the leader propagate that information to
> other followers (like 'B' in this example) so that they also don't GC that
> log segment too early.
> 
> Hope that helps
> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Few Qs on log retention

Posted by Todd Lipcon <to...@cloudera.com>.

On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <di...@cloudera.com> wrote:

> Hi Todd,
>
> While looking at KUDU-1408, which addresses catch-up-game between
> bootstrapping a new replica and log retention on existing replicas, few
> basic Qs popping up in mind:
> Looking at description of LogAnchor class, what’s the purpose of log
> anchor generally.
>

We use the term 'anchor' to mean preventing the garbage collection of old
log segments. The main reasons we anchor logs are currently:

1- Durability: if there is some data that was inserted into a memory store
(MRS or DMS) which has not yet been flushed, we need to retain the log
segment which has the original operation so that we can replay it

2- Time: we respect a configuration --log_min_seconds_to_retain and always
retain any log segment that was closed within that time bound. This is
useful so that, if a follower replica crashes or freezes temporarily and
then comes back within this time period, the leader will still have enough
back history of logs in order to "catch up" the follower.

3- Number of segments: a configuration --log_min_segments_to_retain
determines how many segments of WAL we keep regardless of the above
concerns. This is useful for things like debugging, and to prevent some
edge cases in the code that might happen if there are 0 WALs for a tablet.

Log Anchors are primarily concerned with item #1 above (durability). They
allow a memory store to register an "anchor" corresponding to the operation
index of the first data item that mutated them, and that anchor is kept
alive until the store is durably flushed. The registry keeps the set of
live anchors so that it can quickly be interrogated during log GC.

> Also, is this any different from rejoining a stale replica to the cluster
> ? (stale could mean going as far back as starting a new replica ?)
>

Not sure exactly what you're asking here... maybe the above answers it? The
time-based retention's main goal is to keep some "back history" of each
tablet so that if a replica is only a little bit stale, it can restart and
catch up from the leader's logs. If the leader has already GCed the
necessary logs to catch up the replica (eg the replica was down for too
long) then it will have to be evicted and replaced as a new replica. In
that case, the new replica will start its life using the 'tablet copy'
mechanism to transfer a snapshot of one of the existing replicas. The
snapshot contains on-disk state as well as the WALs necessary to replay
from that state. You can think of it like an rsync from a good replica.

> Instead of sending anchor logs via heartbeats(as you seem to be suggesting
> in the bug), could we not rely on the follower to suggest the leader where
> he wants to replay the logs from ? In other words, the ’tablet copy' begins
> with leader asking the follower what was his(or her ) last known index.  I
> am confident I am missing several pieces here.
>

The thing is that replication is leader-driven, and you may have a leader
re-election at any point. So, if you have three nodes A, B, and C, with A
as the leader, and C as a laggy replica, we want to ensure that neither A
*nor B* GCs the logs that C may need to catch up. Currently there is no
propagation of the information about how far behind C is back up to the WAL
system, and both A and B will happily GC logs based on the factors
mentioned up top in this email. One of the suggestions in KUDU-1408 is to
(1) propagate info on the lagging replica's index to the leader to prevent
GC on the leader, and (2) to have the leader propagate that information to
other followers (like 'B' in this example) so that they also don't GC that
log segment too early.

Hope that helps
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera