You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Ismael Juma <is...@juma.me.uk> on 2016/09/01 11:55:24 UTC

Re: [VOTE] KIP-33 - Add a time based log index

Sounds good to me too.

For completeness, one thing that is mentioned in the JIRA but not in the
message below is: "If the first message in the segment does not have a
timetamp, we will fall back to use the wall clock time for log rolling".

Ismael

On Wed, Aug 31, 2016 at 7:59 PM, Guozhang Wang <wa...@gmail.com> wrote:

> Some of the streams integration tests also encounters this issue where the
> timestamps we are using in the test are very small (e.g. 1,2,3...) which
> cause the log to roll upon each append, and the old segment gets deleted
> very soon. Arguably this can be resolved to enforce LogAppendTime
> configuration on the embedded server.
>
> +1 on the proposed change, makes sense to me.
>
> Guozhang
>
>
> On Tue, Aug 30, 2016 at 4:33 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi folks,
> >
> > Here is another update on the change of time based log rolling.
> >
> > After the latest implementation, we encountered KAFKA-4099. The issue is
> > that if users move replicas, for the messages in the old segments, the
> new
> > replica will create one log segment for each message. The root cause of
> > this is we are comparing the wall clock time with the message timestamp.
> A
> > solution to that is also described in KAFKA-4099, which is to change the
> > log rolling purely based on the timestamp in the messages. More
> > specifically, we roll out the log segment if the timestamp in the current
> > message is greater than the timestamp of the first message in the segment
> > by more than log.roll.ms. This approach is wall clock independent and
> > should solve the problem. With message.timestamp.difference.max.ms
> > configuration, we can achieve 1) the log segment will be rolled out in a
> > bounded time, 2) no excessively large timestamp will be accepted and
> cause
> > frequent log rolling.
> >
> > Any concern regarding this change?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jun 13, 2016 at 2:30 PM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Thanks Jiangjie,
> > >
> > > I see the need for sensitive data purging, the above proposed change
> > LGTM.
> > > One minor concern is that a wrongly marked timestamp on the first
> record
> > > could cause the segment to roll much later / earlier, though it may be
> > > rare.
> > >
> > > Guozhang
> > >
> > > On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > During the implementation of KIP-33, we found it might be useful to
> > have
> > > a
> > > > more deterministic time based log rolling than what proposed in the
> > KIP.
> > > >
> > > > The current KIP proposal uses the largest timestamp in the segment
> for
> > > time
> > > > based rolling. The active log segment only rolls when there is no
> > message
> > > > appended in max.roll.ms since the largest timestamp in the segment.
> > i.e.
> > > > the rolling time may change if user keeping appending messages into
> the
> > > > segment. This may not be a desirable behavior for people who have
> > > sensitive
> > > > data and want to make sure they are removed after some time.
> > > >
> > > > To solve the above issue, we want to modify the KIP proposal
> regarding
> > > the
> > > > time based rolling to the following behavior. The time based log
> > rolling
> > > > will be based on the first message with a timestamp in the log
> segment
> > if
> > > > there is such a message. If no message in the segment has a
> timestamp,
> > > the
> > > > time based log rolling will still be based on log segment create
> time,
> > > > which is the same as we are doing now. The reasons we don't want to
> > > always
> > > > roll based on file create time are because 1) the message timestamp
> may
> > > be
> > > > assigned by clients which can be different from the create time of
> the
> > > log
> > > > segment file. 2) On some Linux, the file create time is not
> available,
> > so
> > > > using segment file create time may not always work.
> > > >
> > > > Do people have any concern for this change? I will update the KIP if
> > > people
> > > > think the change is OK.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks Joel and Ismael. I just updated the KIP based on your
> > feedback.
> > > > >
> > > > > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> > > > >
> > > > > Thanks everyone for the reading, feedback and voting!
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > >
> > > > >> Thanks Becket. I think it would be nice to update the KIP with
> > regards
> > > > to
> > > > >> point 3 and 4.
> > > > >>
> > > > >> In any case, +1 (non-binding)
> > > > >>
> > > > >> Ismael
> > > > >>
> > > > >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <becket.qin@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Thanks for the comments Ismael. Please see the replies inline.
> > > > >> >
> > > > >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <ismael@juma.me.uk
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Jiangjie,
> > > > >> > >
> > > > >> > > Thanks for the KIP, it's a nice improvement. Since it seems
> like
> > > we
> > > > >> have
> > > > >> > > been using the voting thread for discussion, I'll do the same.
> > > > >> > >
> > > > >> > > A few minor comments/questions:
> > > > >> > >
> > > > >> > > 1. The proposed name for the time index file
> > > > >> > `SegmentBaseOffset.timeindex`.
> > > > >> > > Would `SegmentBaseOffset.time-index` be a little better? It
> > would
> > > > >> clearly
> > > > >> > > separate the type of index in case we add additional index
> types
> > > in
> > > > >> the
> > > > >> > > future.
> > > > >> > >
> > > > >> > I have no strong opinion on this, I am not adding any thing
> > > separator
> > > > >> > because it is more regex friendly.
> > > > >> > I am not sure about the other indexes, time and space seems to
> be
> > > two
> > > > >> most
> > > > >> > common dimensions.
> > > > >> >
> > > > >> > 2. When describing the time index entry, we say "Offset - the
> next
> > > > >> offset
> > > > >> > > when the time index entry is inserted". I found the mention of
> > > > `next`
> > > > >> a
> > > > >> > bit
> > > > >> > > confusing as it looks to me like the time index entry has the
> > > first
> > > > >> > offset
> > > > >> > > in the message set.
> > > > >> >
> > > > >> > This semantic meaning is a little different from the offset
> index.
> > > The
> > > > >> > offset index information is self-contained by nature. i.e. all
> the
> > > > >> offsets
> > > > >> > before is smaller than the offset of this message set. So we
> only
> > > need
> > > > >> to
> > > > >> > say "the offset of this message set is OFFSET". This does not
> > quite
> > > > >> apply
> > > > >> > to the time index because the max timestamp may or may not be in
> > the
> > > > >> > message set being appended. So we have to either say, "the max
> > > > timestamp
> > > > >> > before I append this message set is T", or "the max timestamp
> > after
> > > I
> > > > >> > appended this message set is T". The former case means that we
> can
> > > > skip
> > > > >> all
> > > > >> > the previous messages if we are looking for a timestamp > T and
> > > start
> > > > >> from
> > > > >> > this offset. The latter one means if we are searching for
> > timestamp
> > > >
> > > > >> T, we
> > > > >> > should start after this message set, which is essentially the
> same
> > > as
> > > > >> the
> > > > >> > former case but require an additional interpretation.
> > > > >> >
> > > > >> > 3. We say "The default initial / max size of the time index
> files
> > is
> > > > the
> > > > >> > > same as the offset index files. (time index entry is 1.5x of
> the
> > > > size
> > > > >> of
> > > > >> > > offset index entry, user should set the configuration
> > > accordingly)".
> > > > >> It
> > > > >> > may
> > > > >> > > be worth elaborating a little on what a user should do with
> > > regards
> > > > to
> > > > >> > this
> > > > >> > > configuration when upgrading (ie maybe under "Compatibility,
> > > > >> Deprecation,
> > > > >> > > and Migration Plan").
> > > > >> > >
> > > > >> > Makes sense.
> > > > >> >
> > > > >> >
> > > > >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> > > > probably
> > > > >> > > to change
> > > > >> > > the default index size to 2MB to match the default log segment
> > > size"
> > > > >> and
> > > > >> > > you seemed to agree. I couldn't find anything about this in
> the
> > > KIP.
> > > > >> Are
> > > > >> > we
> > > > >> > > still doing it?
> > > > >> > >
> > > > >> > Yes, we can still make the change for default settings. User
> might
> > > > want
> > > > >> to
> > > > >> > set the index size a little larger if they have a customized
> size
> > > but
> > > > in
> > > > >> > reality it should not cause problems other than rolling out a
> > little
> > > > >> more
> > > > >> > log segments.
> > > > >> >
> > > > >> > 5. We say "Instead, it is only monotonically increasing within
> > each
> > > > time
> > > > >> > > index file. i.e. It is possible that the time index file for a
> > > later
> > > > >> log
> > > > >> > > segment contains smaller timestamp than some timestamp in the
> > time
> > > > >> index
> > > > >> > > file of an earlier segment.". I think it would be good to
> > explain
> > > > >> under
> > > > >> > > which scenario a time index file for a later log segment
> > contains
> > > a
> > > > >> > smaller
> > > > >> > > timestamp (is this only when CreateTime is used?).
> > > > >> > >
> > > > >> > Yes, it only happens when CreateTime is used.
> > > > >> >
> > > > >> >
> > > > >> > > 6. We say "When searching by timestamp, broker will start from
> > the
> > > > >> > earliest
> > > > >> > > log segment and check the last time index entry.". The
> existing
> > > > logic
> > > > >> > > searches from newest segment backwards. Is there a reason why
> we
> > > are
> > > > >> > > changing it?
> > > > >> > >
> > > > >> > Suppose segment 0 has max timestamp 100, segment 1 has max
> > timestamp
> > > > 50
> > > > >> and
> > > > >> > segment 3 has max timestamp 90, now user want to search for
> > > timestamp
> > > > >> 80.
> > > > >> > If we search backwards, we have to take a look at all the
> > segments.
> > > If
> > > > >> we
> > > > >> > search forward, we will stop at the first segment whose max
> > > timestamp
> > > > is
> > > > >> > greater than 80 (i.e all the previous segments has smaller
> > > timestamps)
> > > > >> and
> > > > >> > start the finer search on that segment.
> > > > >> >
> > > > >> >
> > > > >> > > 7. Do you mind if I fix typos and minor grammar issues
> directly
> > in
> > > > the
> > > > >> > > wiki? It seems easier than doing that via email.
> > > > >> > >
> > > > >> > Not at all, thanks for help.
> > > > >> >
> > > > >> >
> > > > >> > > Thanks,
> > > > >> > > Ismael
> > > > >> > >
> > > > >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <
> > becket.qin@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > I updated KIP-33 based on the initial implementation. Per
> > > > >> discussion on
> > > > >> > > > yesterday's KIP hangout, I would like to initiate the new
> vote
> > > > >> thread
> > > > >> > for
> > > > >> > > > KIP-33.
> > > > >> > > >
> > > > >> > > > The KIP wiki:
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 33+-+Add+a+time+based+log+index
> > > > >> > > >
> > > > >> > > > Here is a brief summary of the KIP:
> > > > >> > > > 1. We propose to add a time index for each log segment.
> > > > >> > > > 2. The time indices are going to be used of log retention,
> log
> > > > >> rolling
> > > > >> > > and
> > > > >> > > > message search by timestamp.
> > > > >> > > >
> > > > >> > > > There was an old voting thread which has some discussions on
> > > this
> > > > >> KIP.
> > > > >> > > The
> > > > >> > > > mail thread link is following:
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%
> > > 3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com
> %3E
> > > > >> > > >
> > > > >> > > > I have the following WIP patch for reference. It needs a few
> > > more
> > > > >> unit
> > > > >> > > > tests and documentation. Other than that it should run fine.
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > https://github.com/becketqin/kafka/commit/
> > 712357a3fbf1423e05f9eed7d2fed5
> > > b6fe6c37b7
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Jiangjie (Becket) Qin
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>