You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Becket Qin <be...@gmail.com> on 2016/04/07 02:44:24 UTC

[VOTE] KIP-33 - Add a time based log index

Hi all,

I updated KIP-33 based on the initial implementation. Per discussion on
yesterday's KIP hangout, I would like to initiate the new vote thread for
KIP-33.

The KIP wiki:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index

Here is a brief summary of the KIP:
1. We propose to add a time index for each log segment.
2. The time indices are going to be used of log retention, log rolling and
message search by timestamp.

There was an old voting thread which has some discussions on this KIP. The
mail thread link is following:
http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E

I have the following WIP patch for reference. It needs a few more unit
tests and documentation. Other than that it should run fine.
https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7

Thanks,

Jiangjie (Becket) Qin

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Joel Koshy <jj...@gmail.com>.

Hi Becket,

+1

Just a few more minor comments that I don't think should result in any
material change to the KIP:

   - Under "time index format":
*A time index entry (timestamp, offset) means that any message whose
   timestamp is greater than timestamp come after offset. *
      - I think it may be clear enough from the context, but I think this
      is only true within a segment so consider making that explicit. Also, a
      minor nit-pick that would help reduce the burden on the reader: the text
      uses the same label for all three references - i.e., (*timestamp*,
      offset) and then two mentions of *timestamp* in the same sentence.
      Would be clearer if it is something like: *A time index entry (T,
      offset) means that any message whose timestamp is greater than T ...*
   - Under "build the time index"
      - *User should set the configuration accordingly* is vague. You mean
      in order to satisfy some user-determined size constraint?
      - *The time index is not globally monotonically increasing*: globally
      is a confusing word to use here. I think you basically mean to
say does not
      necessarily monotonically increase within a partition. Instead,
it is only
      monotonically increasing within each time index...
      - *The density of the index is defined by index.interval.bytes*: I
      don't quite see how this is directly defined by that configuration. i.e.,
      it has a direct effect on the regular offset index density but only a
      secondary effect on the time index density right?
      - Under "Enforce time-based log retention":
      - I vaguely remember some mention of avoiding gaps - i.e., we enforce
      retention left to right. i.e., start removing from the first (by
creation)
      log segment and stop when we reach an ineligible segment. Unless
something
      changed may be worth explicitly calling that out.
   - There are a few more minor clarifying edits that I can help with
   offline.

Thanks,

Joel

On Mon, Apr 18, 2016 at 7:35 PM, Becket Qin <be...@gmail.com> wrote:

> Hi Grant,
>
> Thanks for the review. Please see the replies inline.
>
> On Mon, Apr 18, 2016 at 1:56 PM, Grant Henke <gh...@cloudera.com> wrote:
>
> > Thanks for the detailed write up Jiangjie. Overall the proposal looks
> good
> > to me. I have a few implementation detail questions that
> > don't necessarily need to block progress:
> >
> > When searching by timestamp, broker will start from the earliest log
> > > segment and check the last time index entry. If the timestamp of the
> last
> > > time index entry is greater than the target timestamp, the broker will
> do
> > > binary search on that time index to find the closest index entry and
> scan
> > > the log from there. Otherwise it will move on to the next log segment.
> > >
> >
> > Does this mean having more old data will increase the length of timestamp
> > searches? Have you considered searching from the most recent segment? I
> am
> > thinking looking up more recent data is likely the common case but
> haven't
> > thought about it too much.
> >
> > Segment level search is very quick. i.e. read the last time index entry
> of
> a segment and see if it is greater than the target timestamp.
> Having more old data does mean we need to search more segments, but since
> checking on a segment is cheap (12 bytes read) it is probably fine.
>
>
> > To enforce time based log retention, the broker will check the last time
> > > index entry of a log segment. The timestamp will be the latest
> timestamp
> > of
> > > the messages in the log segment. So if that timestamp expires, the
> broker
> > > will delete the log segment.
> > >
> >
> > Does this mean the choice between message.timestamp.type affects
> retention
> > time?
>
> The time index is timestamp type agnostic. No matter the timestamp type is
> CreateTime or LogAppendTime, the same algorithm would work.
>
> >
> >
> When broker receives a message, if the message is not rejected due to
> > > timestamp exceeds threshold, the message will be appended to the log.
> > (The
> > > timestamp will either be LogAppendTime or CreateTime depending on the
> > > configuration)
> > >
>
>
> > What happens when message.timestamp.type is changed?
> >
> Because the time index is timestamp type agnostic. The message timestamp
> change is transparent to the time index. So it seems there is no special
> cases to be taken care of.
>
> >
> > Are there any special considerations for compacted topics?
>
> I did not see anything special there. The only thing special might be
> appending the last time index entry to the cleaned segment after cleaning.
>
> >
> > Thank you,
> > Grant
> >
> >
> > On Mon, Apr 18, 2016 at 8:50 AM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Hi Jiangjie,
> > >
> > > Thanks for the KIP, it's a nice improvement. Since it seems like we
> have
> > > been using the voting thread for discussion, I'll do the same.
> > >
> > > A few minor comments/questions:
> > >
> > > 1. The proposed name for the time index file
> > `SegmentBaseOffset.timeindex`.
> > > Would `SegmentBaseOffset.time-index` be a little better? It would
> clearly
> > > separate the type of index in case we add additional index types in the
> > > future.
> > > 2. When describing the time index entry, we say "Offset - the next
> offset
> > > when the time index entry is inserted". I found the mention of `next` a
> > bit
> > > confusing as it looks to me like the time index entry has the first
> > offset
> > > in the message set.
> > > 3. We say "The default initial / max size of the time index files is
> the
> > > same as the offset index files. (time index entry is 1.5x of the size
> of
> > > offset index entry, user should set the configuration accordingly)". It
> > may
> > > be worth elaborating a little on what a user should do with regards to
> > this
> > > configuration when upgrading (ie maybe under "Compatibility,
> Deprecation,
> > > and Migration Plan").
> > > 4. In a previous vote thread, Jun said "The simplest thing is probably
> > > to change
> > > the default index size to 2MB to match the default log segment size"
> and
> > > you seemed to agree. I couldn't find anything about this in the KIP.
> Are
> > we
> > > still doing it?
> > > 5. We say "Instead, it is only monotonically increasing within each
> time
> > > index file. i.e. It is possible that the time index file for a later
> log
> > > segment contains smaller timestamp than some timestamp in the time
> index
> > > file of an earlier segment.". I think it would be good to explain under
> > > which scenario a time index file for a later log segment contains a
> > smaller
> > > timestamp (is this only when CreateTime is used?).
> > > 6. We say "When searching by timestamp, broker will start from the
> > earliest
> > > log segment and check the last time index entry.". The existing logic
> > > searches from newest segment backwards. Is there a reason why we are
> > > changing it?
> > > 7. Do you mind if I fix typos and minor grammar issues directly in the
> > > wiki? It seems easier than doing that via email.
> > >
> > > Thanks,
> > > Ismael
> > >
> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I updated KIP-33 based on the initial implementation. Per discussion
> on
> > > > yesterday's KIP hangout, I would like to initiate the new vote thread
> > for
> > > > KIP-33.
> > > >
> > > > The KIP wiki:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > > >
> > > > Here is a brief summary of the KIP:
> > > > 1. We propose to add a time index for each log segment.
> > > > 2. The time indices are going to be used of log retention, log
> rolling
> > > and
> > > > message search by timestamp.
> > > >
> > > > There was an old voting thread which has some discussions on this
> KIP.
> > > The
> > > > mail thread link is following:
> > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > > >
> > > > I have the following WIP patch for reference. It needs a few more
> unit
> > > > tests and documentation. Other than that it should run fine.
> > > >
> > > >
> > >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > >
> >
> >
> >
> > --
> > Grant Henke
> > Software Engineer | Cloudera
> > grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Hi Grant,

Thanks for the review. Please see the replies inline.

On Mon, Apr 18, 2016 at 1:56 PM, Grant Henke <gh...@cloudera.com> wrote:

> Thanks for the detailed write up Jiangjie. Overall the proposal looks good
> to me. I have a few implementation detail questions that
> don't necessarily need to block progress:
>
> When searching by timestamp, broker will start from the earliest log
> > segment and check the last time index entry. If the timestamp of the last
> > time index entry is greater than the target timestamp, the broker will do
> > binary search on that time index to find the closest index entry and scan
> > the log from there. Otherwise it will move on to the next log segment.
> >
>
> Does this mean having more old data will increase the length of timestamp
> searches? Have you considered searching from the most recent segment? I am
> thinking looking up more recent data is likely the common case but haven't
> thought about it too much.
>
> Segment level search is very quick. i.e. read the last time index entry of
a segment and see if it is greater than the target timestamp.
Having more old data does mean we need to search more segments, but since
checking on a segment is cheap (12 bytes read) it is probably fine.


> To enforce time based log retention, the broker will check the last time
> > index entry of a log segment. The timestamp will be the latest timestamp
> of
> > the messages in the log segment. So if that timestamp expires, the broker
> > will delete the log segment.
> >
>
> Does this mean the choice between message.timestamp.type affects retention
> time?

The time index is timestamp type agnostic. No matter the timestamp type is
CreateTime or LogAppendTime, the same algorithm would work.

>
>
When broker receives a message, if the message is not rejected due to
> > timestamp exceeds threshold, the message will be appended to the log.
> (The
> > timestamp will either be LogAppendTime or CreateTime depending on the
> > configuration)
> >


> What happens when message.timestamp.type is changed?
>
Because the time index is timestamp type agnostic. The message timestamp
change is transparent to the time index. So it seems there is no special
cases to be taken care of.

>
> Are there any special considerations for compacted topics?

I did not see anything special there. The only thing special might be
appending the last time index entry to the cleaned segment after cleaning.

>
> Thank you,
> Grant
>
>
> On Mon, Apr 18, 2016 at 8:50 AM, Ismael Juma <is...@juma.me.uk> wrote:
>
> > Hi Jiangjie,
> >
> > Thanks for the KIP, it's a nice improvement. Since it seems like we have
> > been using the voting thread for discussion, I'll do the same.
> >
> > A few minor comments/questions:
> >
> > 1. The proposed name for the time index file
> `SegmentBaseOffset.timeindex`.
> > Would `SegmentBaseOffset.time-index` be a little better? It would clearly
> > separate the type of index in case we add additional index types in the
> > future.
> > 2. When describing the time index entry, we say "Offset - the next offset
> > when the time index entry is inserted". I found the mention of `next` a
> bit
> > confusing as it looks to me like the time index entry has the first
> offset
> > in the message set.
> > 3. We say "The default initial / max size of the time index files is the
> > same as the offset index files. (time index entry is 1.5x of the size of
> > offset index entry, user should set the configuration accordingly)". It
> may
> > be worth elaborating a little on what a user should do with regards to
> this
> > configuration when upgrading (ie maybe under "Compatibility, Deprecation,
> > and Migration Plan").
> > 4. In a previous vote thread, Jun said "The simplest thing is probably
> > to change
> > the default index size to 2MB to match the default log segment size" and
> > you seemed to agree. I couldn't find anything about this in the KIP. Are
> we
> > still doing it?
> > 5. We say "Instead, it is only monotonically increasing within each time
> > index file. i.e. It is possible that the time index file for a later log
> > segment contains smaller timestamp than some timestamp in the time index
> > file of an earlier segment.". I think it would be good to explain under
> > which scenario a time index file for a later log segment contains a
> smaller
> > timestamp (is this only when CreateTime is used?).
> > 6. We say "When searching by timestamp, broker will start from the
> earliest
> > log segment and check the last time index entry.". The existing logic
> > searches from newest segment backwards. Is there a reason why we are
> > changing it?
> > 7. Do you mind if I fix typos and minor grammar issues directly in the
> > wiki? It seems easier than doing that via email.
> >
> > Thanks,
> > Ismael
> >
> > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I updated KIP-33 based on the initial implementation. Per discussion on
> > > yesterday's KIP hangout, I would like to initiate the new vote thread
> for
> > > KIP-33.
> > >
> > > The KIP wiki:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > >
> > > Here is a brief summary of the KIP:
> > > 1. We propose to add a time index for each log segment.
> > > 2. The time indices are going to be used of log retention, log rolling
> > and
> > > message search by timestamp.
> > >
> > > There was an old voting thread which has some discussions on this KIP.
> > The
> > > mail thread link is following:
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > >
> > > I have the following WIP patch for reference. It needs a few more unit
> > > tests and documentation. Other than that it should run fine.
> > >
> > >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> >
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Grant Henke <gh...@cloudera.com>.

Thanks for the detailed write up Jiangjie. Overall the proposal looks good
to me. I have a few implementation detail questions that
don't necessarily need to block progress:

When searching by timestamp, broker will start from the earliest log
> segment and check the last time index entry. If the timestamp of the last
> time index entry is greater than the target timestamp, the broker will do
> binary search on that time index to find the closest index entry and scan
> the log from there. Otherwise it will move on to the next log segment.
>

Does this mean having more old data will increase the length of timestamp
searches? Have you considered searching from the most recent segment? I am
thinking looking up more recent data is likely the common case but haven't
thought about it too much.

To enforce time based log retention, the broker will check the last time
> index entry of a log segment. The timestamp will be the latest timestamp of
> the messages in the log segment. So if that timestamp expires, the broker
> will delete the log segment.
>

Does this mean the choice between message.timestamp.type affects retention
time?

When broker receives a message, if the message is not rejected due to
> timestamp exceeds threshold, the message will be appended to the log. (The
> timestamp will either be LogAppendTime or CreateTime depending on the
> configuration)
>

What happens when message.timestamp.type is changed?

Are there any special considerations for compacted topics?

Thank you,
Grant


On Mon, Apr 18, 2016 at 8:50 AM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Jiangjie,
>
> Thanks for the KIP, it's a nice improvement. Since it seems like we have
> been using the voting thread for discussion, I'll do the same.
>
> A few minor comments/questions:
>
> 1. The proposed name for the time index file `SegmentBaseOffset.timeindex`.
> Would `SegmentBaseOffset.time-index` be a little better? It would clearly
> separate the type of index in case we add additional index types in the
> future.
> 2. When describing the time index entry, we say "Offset - the next offset
> when the time index entry is inserted". I found the mention of `next` a bit
> confusing as it looks to me like the time index entry has the first offset
> in the message set.
> 3. We say "The default initial / max size of the time index files is the
> same as the offset index files. (time index entry is 1.5x of the size of
> offset index entry, user should set the configuration accordingly)". It may
> be worth elaborating a little on what a user should do with regards to this
> configuration when upgrading (ie maybe under "Compatibility, Deprecation,
> and Migration Plan").
> 4. In a previous vote thread, Jun said "The simplest thing is probably
> to change
> the default index size to 2MB to match the default log segment size" and
> you seemed to agree. I couldn't find anything about this in the KIP. Are we
> still doing it?
> 5. We say "Instead, it is only monotonically increasing within each time
> index file. i.e. It is possible that the time index file for a later log
> segment contains smaller timestamp than some timestamp in the time index
> file of an earlier segment.". I think it would be good to explain under
> which scenario a time index file for a later log segment contains a smaller
> timestamp (is this only when CreateTime is used?).
> 6. We say "When searching by timestamp, broker will start from the earliest
> log segment and check the last time index entry.". The existing logic
> searches from newest segment backwards. Is there a reason why we are
> changing it?
> 7. Do you mind if I fix typos and minor grammar issues directly in the
> wiki? It seems easier than doing that via email.
>
> Thanks,
> Ismael
>
> On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi all,
> >
> > I updated KIP-33 based on the initial implementation. Per discussion on
> > yesterday's KIP hangout, I would like to initiate the new vote thread for
> > KIP-33.
> >
> > The KIP wiki:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> >
> > Here is a brief summary of the KIP:
> > 1. We propose to add a time index for each log segment.
> > 2. The time indices are going to be used of log retention, log rolling
> and
> > message search by timestamp.
> >
> > There was an old voting thread which has some discussions on this KIP.
> The
> > mail thread link is following:
> >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> >
> > I have the following WIP patch for reference. It needs a few more unit
> > tests and documentation. Other than that it should run fine.
> >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Thanks for the feedback, Guozhang. I'll update the KIP wiki and submit a
patch if no one have concerns on this.

On Wed, Aug 31, 2016 at 11:59 AM, Guozhang Wang <wa...@gmail.com> wrote:

> Some of the streams integration tests also encounters this issue where the
> timestamps we are using in the test are very small (e.g. 1,2,3...) which
> cause the log to roll upon each append, and the old segment gets deleted
> very soon. Arguably this can be resolved to enforce LogAppendTime
> configuration on the embedded server.
>
> +1 on the proposed change, makes sense to me.
>
> Guozhang
>
>
> On Tue, Aug 30, 2016 at 4:33 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi folks,
> >
> > Here is another update on the change of time based log rolling.
> >
> > After the latest implementation, we encountered KAFKA-4099. The issue is
> > that if users move replicas, for the messages in the old segments, the
> new
> > replica will create one log segment for each message. The root cause of
> > this is we are comparing the wall clock time with the message timestamp.
> A
> > solution to that is also described in KAFKA-4099, which is to change the
> > log rolling purely based on the timestamp in the messages. More
> > specifically, we roll out the log segment if the timestamp in the current
> > message is greater than the timestamp of the first message in the segment
> > by more than log.roll.ms. This approach is wall clock independent and
> > should solve the problem. With message.timestamp.difference.max.ms
> > configuration, we can achieve 1) the log segment will be rolled out in a
> > bounded time, 2) no excessively large timestamp will be accepted and
> cause
> > frequent log rolling.
> >
> > Any concern regarding this change?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jun 13, 2016 at 2:30 PM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Thanks Jiangjie,
> > >
> > > I see the need for sensitive data purging, the above proposed change
> > LGTM.
> > > One minor concern is that a wrongly marked timestamp on the first
> record
> > > could cause the segment to roll much later / earlier, though it may be
> > > rare.
> > >
> > > Guozhang
> > >
> > > On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > During the implementation of KIP-33, we found it might be useful to
> > have
> > > a
> > > > more deterministic time based log rolling than what proposed in the
> > KIP.
> > > >
> > > > The current KIP proposal uses the largest timestamp in the segment
> for
> > > time
> > > > based rolling. The active log segment only rolls when there is no
> > message
> > > > appended in max.roll.ms since the largest timestamp in the segment.
> > i.e.
> > > > the rolling time may change if user keeping appending messages into
> the
> > > > segment. This may not be a desirable behavior for people who have
> > > sensitive
> > > > data and want to make sure they are removed after some time.
> > > >
> > > > To solve the above issue, we want to modify the KIP proposal
> regarding
> > > the
> > > > time based rolling to the following behavior. The time based log
> > rolling
> > > > will be based on the first message with a timestamp in the log
> segment
> > if
> > > > there is such a message. If no message in the segment has a
> timestamp,
> > > the
> > > > time based log rolling will still be based on log segment create
> time,
> > > > which is the same as we are doing now. The reasons we don't want to
> > > always
> > > > roll based on file create time are because 1) the message timestamp
> may
> > > be
> > > > assigned by clients which can be different from the create time of
> the
> > > log
> > > > segment file. 2) On some Linux, the file create time is not
> available,
> > so
> > > > using segment file create time may not always work.
> > > >
> > > > Do people have any concern for this change? I will update the KIP if
> > > people
> > > > think the change is OK.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks Joel and Ismael. I just updated the KIP based on your
> > feedback.
> > > > >
> > > > > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> > > > >
> > > > > Thanks everyone for the reading, feedback and voting!
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > >
> > > > >> Thanks Becket. I think it would be nice to update the KIP with
> > regards
> > > > to
> > > > >> point 3 and 4.
> > > > >>
> > > > >> In any case, +1 (non-binding)
> > > > >>
> > > > >> Ismael
> > > > >>
> > > > >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <becket.qin@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Thanks for the comments Ismael. Please see the replies inline.
> > > > >> >
> > > > >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <ismael@juma.me.uk
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Jiangjie,
> > > > >> > >
> > > > >> > > Thanks for the KIP, it's a nice improvement. Since it seems
> like
> > > we
> > > > >> have
> > > > >> > > been using the voting thread for discussion, I'll do the same.
> > > > >> > >
> > > > >> > > A few minor comments/questions:
> > > > >> > >
> > > > >> > > 1. The proposed name for the time index file
> > > > >> > `SegmentBaseOffset.timeindex`.
> > > > >> > > Would `SegmentBaseOffset.time-index` be a little better? It
> > would
> > > > >> clearly
> > > > >> > > separate the type of index in case we add additional index
> types
> > > in
> > > > >> the
> > > > >> > > future.
> > > > >> > >
> > > > >> > I have no strong opinion on this, I am not adding any thing
> > > separator
> > > > >> > because it is more regex friendly.
> > > > >> > I am not sure about the other indexes, time and space seems to
> be
> > > two
> > > > >> most
> > > > >> > common dimensions.
> > > > >> >
> > > > >> > 2. When describing the time index entry, we say "Offset - the
> next
> > > > >> offset
> > > > >> > > when the time index entry is inserted". I found the mention of
> > > > `next`
> > > > >> a
> > > > >> > bit
> > > > >> > > confusing as it looks to me like the time index entry has the
> > > first
> > > > >> > offset
> > > > >> > > in the message set.
> > > > >> >
> > > > >> > This semantic meaning is a little different from the offset
> index.
> > > The
> > > > >> > offset index information is self-contained by nature. i.e. all
> the
> > > > >> offsets
> > > > >> > before is smaller than the offset of this message set. So we
> only
> > > need
> > > > >> to
> > > > >> > say "the offset of this message set is OFFSET". This does not
> > quite
> > > > >> apply
> > > > >> > to the time index because the max timestamp may or may not be in
> > the
> > > > >> > message set being appended. So we have to either say, "the max
> > > > timestamp
> > > > >> > before I append this message set is T", or "the max timestamp
> > after
> > > I
> > > > >> > appended this message set is T". The former case means that we
> can
> > > > skip
> > > > >> all
> > > > >> > the previous messages if we are looking for a timestamp > T and
> > > start
> > > > >> from
> > > > >> > this offset. The latter one means if we are searching for
> > timestamp
> > > >
> > > > >> T, we
> > > > >> > should start after this message set, which is essentially the
> same
> > > as
> > > > >> the
> > > > >> > former case but require an additional interpretation.
> > > > >> >
> > > > >> > 3. We say "The default initial / max size of the time index
> files
> > is
> > > > the
> > > > >> > > same as the offset index files. (time index entry is 1.5x of
> the
> > > > size
> > > > >> of
> > > > >> > > offset index entry, user should set the configuration
> > > accordingly)".
> > > > >> It
> > > > >> > may
> > > > >> > > be worth elaborating a little on what a user should do with
> > > regards
> > > > to
> > > > >> > this
> > > > >> > > configuration when upgrading (ie maybe under "Compatibility,
> > > > >> Deprecation,
> > > > >> > > and Migration Plan").
> > > > >> > >
> > > > >> > Makes sense.
> > > > >> >
> > > > >> >
> > > > >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> > > > probably
> > > > >> > > to change
> > > > >> > > the default index size to 2MB to match the default log segment
> > > size"
> > > > >> and
> > > > >> > > you seemed to agree. I couldn't find anything about this in
> the
> > > KIP.
> > > > >> Are
> > > > >> > we
> > > > >> > > still doing it?
> > > > >> > >
> > > > >> > Yes, we can still make the change for default settings. User
> might
> > > > want
> > > > >> to
> > > > >> > set the index size a little larger if they have a customized
> size
> > > but
> > > > in
> > > > >> > reality it should not cause problems other than rolling out a
> > little
> > > > >> more
> > > > >> > log segments.
> > > > >> >
> > > > >> > 5. We say "Instead, it is only monotonically increasing within
> > each
> > > > time
> > > > >> > > index file. i.e. It is possible that the time index file for a
> > > later
> > > > >> log
> > > > >> > > segment contains smaller timestamp than some timestamp in the
> > time
> > > > >> index
> > > > >> > > file of an earlier segment.". I think it would be good to
> > explain
> > > > >> under
> > > > >> > > which scenario a time index file for a later log segment
> > contains
> > > a
> > > > >> > smaller
> > > > >> > > timestamp (is this only when CreateTime is used?).
> > > > >> > >
> > > > >> > Yes, it only happens when CreateTime is used.
> > > > >> >
> > > > >> >
> > > > >> > > 6. We say "When searching by timestamp, broker will start from
> > the
> > > > >> > earliest
> > > > >> > > log segment and check the last time index entry.". The
> existing
> > > > logic
> > > > >> > > searches from newest segment backwards. Is there a reason why
> we
> > > are
> > > > >> > > changing it?
> > > > >> > >
> > > > >> > Suppose segment 0 has max timestamp 100, segment 1 has max
> > timestamp
> > > > 50
> > > > >> and
> > > > >> > segment 3 has max timestamp 90, now user want to search for
> > > timestamp
> > > > >> 80.
> > > > >> > If we search backwards, we have to take a look at all the
> > segments.
> > > If
> > > > >> we
> > > > >> > search forward, we will stop at the first segment whose max
> > > timestamp
> > > > is
> > > > >> > greater than 80 (i.e all the previous segments has smaller
> > > timestamps)
> > > > >> and
> > > > >> > start the finer search on that segment.
> > > > >> >
> > > > >> >
> > > > >> > > 7. Do you mind if I fix typos and minor grammar issues
> directly
> > in
> > > > the
> > > > >> > > wiki? It seems easier than doing that via email.
> > > > >> > >
> > > > >> > Not at all, thanks for help.
> > > > >> >
> > > > >> >
> > > > >> > > Thanks,
> > > > >> > > Ismael
> > > > >> > >
> > > > >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <
> > becket.qin@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > I updated KIP-33 based on the initial implementation. Per
> > > > >> discussion on
> > > > >> > > > yesterday's KIP hangout, I would like to initiate the new
> vote
> > > > >> thread
> > > > >> > for
> > > > >> > > > KIP-33.
> > > > >> > > >
> > > > >> > > > The KIP wiki:
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 33+-+Add+a+time+based+log+index
> > > > >> > > >
> > > > >> > > > Here is a brief summary of the KIP:
> > > > >> > > > 1. We propose to add a time index for each log segment.
> > > > >> > > > 2. The time indices are going to be used of log retention,
> log
> > > > >> rolling
> > > > >> > > and
> > > > >> > > > message search by timestamp.
> > > > >> > > >
> > > > >> > > > There was an old voting thread which has some discussions on
> > > this
> > > > >> KIP.
> > > > >> > > The
> > > > >> > > > mail thread link is following:
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%
> > > 3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com
> %3E
> > > > >> > > >
> > > > >> > > > I have the following WIP patch for reference. It needs a few
> > > more
> > > > >> unit
> > > > >> > > > tests and documentation. Other than that it should run fine.
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > https://github.com/becketqin/kafka/commit/
> > 712357a3fbf1423e05f9eed7d2fed5
> > > b6fe6c37b7
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Jiangjie (Becket) Qin
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Ismael Juma <is...@juma.me.uk>.

Sounds good to me too.

For completeness, one thing that is mentioned in the JIRA but not in the
message below is: "If the first message in the segment does not have a
timetamp, we will fall back to use the wall clock time for log rolling".

Ismael

On Wed, Aug 31, 2016 at 7:59 PM, Guozhang Wang <wa...@gmail.com> wrote:

> Some of the streams integration tests also encounters this issue where the
> timestamps we are using in the test are very small (e.g. 1,2,3...) which
> cause the log to roll upon each append, and the old segment gets deleted
> very soon. Arguably this can be resolved to enforce LogAppendTime
> configuration on the embedded server.
>
> +1 on the proposed change, makes sense to me.
>
> Guozhang
>
>
> On Tue, Aug 30, 2016 at 4:33 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi folks,
> >
> > Here is another update on the change of time based log rolling.
> >
> > After the latest implementation, we encountered KAFKA-4099. The issue is
> > that if users move replicas, for the messages in the old segments, the
> new
> > replica will create one log segment for each message. The root cause of
> > this is we are comparing the wall clock time with the message timestamp.
> A
> > solution to that is also described in KAFKA-4099, which is to change the
> > log rolling purely based on the timestamp in the messages. More
> > specifically, we roll out the log segment if the timestamp in the current
> > message is greater than the timestamp of the first message in the segment
> > by more than log.roll.ms. This approach is wall clock independent and
> > should solve the problem. With message.timestamp.difference.max.ms
> > configuration, we can achieve 1) the log segment will be rolled out in a
> > bounded time, 2) no excessively large timestamp will be accepted and
> cause
> > frequent log rolling.
> >
> > Any concern regarding this change?
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Jun 13, 2016 at 2:30 PM, Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Thanks Jiangjie,
> > >
> > > I see the need for sensitive data purging, the above proposed change
> > LGTM.
> > > One minor concern is that a wrongly marked timestamp on the first
> record
> > > could cause the segment to roll much later / earlier, though it may be
> > > rare.
> > >
> > > Guozhang
> > >
> > > On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > During the implementation of KIP-33, we found it might be useful to
> > have
> > > a
> > > > more deterministic time based log rolling than what proposed in the
> > KIP.
> > > >
> > > > The current KIP proposal uses the largest timestamp in the segment
> for
> > > time
> > > > based rolling. The active log segment only rolls when there is no
> > message
> > > > appended in max.roll.ms since the largest timestamp in the segment.
> > i.e.
> > > > the rolling time may change if user keeping appending messages into
> the
> > > > segment. This may not be a desirable behavior for people who have
> > > sensitive
> > > > data and want to make sure they are removed after some time.
> > > >
> > > > To solve the above issue, we want to modify the KIP proposal
> regarding
> > > the
> > > > time based rolling to the following behavior. The time based log
> > rolling
> > > > will be based on the first message with a timestamp in the log
> segment
> > if
> > > > there is such a message. If no message in the segment has a
> timestamp,
> > > the
> > > > time based log rolling will still be based on log segment create
> time,
> > > > which is the same as we are doing now. The reasons we don't want to
> > > always
> > > > roll based on file create time are because 1) the message timestamp
> may
> > > be
> > > > assigned by clients which can be different from the create time of
> the
> > > log
> > > > segment file. 2) On some Linux, the file create time is not
> available,
> > so
> > > > using segment file create time may not always work.
> > > >
> > > > Do people have any concern for this change? I will update the KIP if
> > > people
> > > > think the change is OK.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks Joel and Ismael. I just updated the KIP based on your
> > feedback.
> > > > >
> > > > > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> > > > >
> > > > > Thanks everyone for the reading, feedback and voting!
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > > >
> > > > >> Thanks Becket. I think it would be nice to update the KIP with
> > regards
> > > > to
> > > > >> point 3 and 4.
> > > > >>
> > > > >> In any case, +1 (non-binding)
> > > > >>
> > > > >> Ismael
> > > > >>
> > > > >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <becket.qin@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Thanks for the comments Ismael. Please see the replies inline.
> > > > >> >
> > > > >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <ismael@juma.me.uk
> >
> > > > wrote:
> > > > >> >
> > > > >> > > Hi Jiangjie,
> > > > >> > >
> > > > >> > > Thanks for the KIP, it's a nice improvement. Since it seems
> like
> > > we
> > > > >> have
> > > > >> > > been using the voting thread for discussion, I'll do the same.
> > > > >> > >
> > > > >> > > A few minor comments/questions:
> > > > >> > >
> > > > >> > > 1. The proposed name for the time index file
> > > > >> > `SegmentBaseOffset.timeindex`.
> > > > >> > > Would `SegmentBaseOffset.time-index` be a little better? It
> > would
> > > > >> clearly
> > > > >> > > separate the type of index in case we add additional index
> types
> > > in
> > > > >> the
> > > > >> > > future.
> > > > >> > >
> > > > >> > I have no strong opinion on this, I am not adding any thing
> > > separator
> > > > >> > because it is more regex friendly.
> > > > >> > I am not sure about the other indexes, time and space seems to
> be
> > > two
> > > > >> most
> > > > >> > common dimensions.
> > > > >> >
> > > > >> > 2. When describing the time index entry, we say "Offset - the
> next
> > > > >> offset
> > > > >> > > when the time index entry is inserted". I found the mention of
> > > > `next`
> > > > >> a
> > > > >> > bit
> > > > >> > > confusing as it looks to me like the time index entry has the
> > > first
> > > > >> > offset
> > > > >> > > in the message set.
> > > > >> >
> > > > >> > This semantic meaning is a little different from the offset
> index.
> > > The
> > > > >> > offset index information is self-contained by nature. i.e. all
> the
> > > > >> offsets
> > > > >> > before is smaller than the offset of this message set. So we
> only
> > > need
> > > > >> to
> > > > >> > say "the offset of this message set is OFFSET". This does not
> > quite
> > > > >> apply
> > > > >> > to the time index because the max timestamp may or may not be in
> > the
> > > > >> > message set being appended. So we have to either say, "the max
> > > > timestamp
> > > > >> > before I append this message set is T", or "the max timestamp
> > after
> > > I
> > > > >> > appended this message set is T". The former case means that we
> can
> > > > skip
> > > > >> all
> > > > >> > the previous messages if we are looking for a timestamp > T and
> > > start
> > > > >> from
> > > > >> > this offset. The latter one means if we are searching for
> > timestamp
> > > >
> > > > >> T, we
> > > > >> > should start after this message set, which is essentially the
> same
> > > as
> > > > >> the
> > > > >> > former case but require an additional interpretation.
> > > > >> >
> > > > >> > 3. We say "The default initial / max size of the time index
> files
> > is
> > > > the
> > > > >> > > same as the offset index files. (time index entry is 1.5x of
> the
> > > > size
> > > > >> of
> > > > >> > > offset index entry, user should set the configuration
> > > accordingly)".
> > > > >> It
> > > > >> > may
> > > > >> > > be worth elaborating a little on what a user should do with
> > > regards
> > > > to
> > > > >> > this
> > > > >> > > configuration when upgrading (ie maybe under "Compatibility,
> > > > >> Deprecation,
> > > > >> > > and Migration Plan").
> > > > >> > >
> > > > >> > Makes sense.
> > > > >> >
> > > > >> >
> > > > >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> > > > probably
> > > > >> > > to change
> > > > >> > > the default index size to 2MB to match the default log segment
> > > size"
> > > > >> and
> > > > >> > > you seemed to agree. I couldn't find anything about this in
> the
> > > KIP.
> > > > >> Are
> > > > >> > we
> > > > >> > > still doing it?
> > > > >> > >
> > > > >> > Yes, we can still make the change for default settings. User
> might
> > > > want
> > > > >> to
> > > > >> > set the index size a little larger if they have a customized
> size
> > > but
> > > > in
> > > > >> > reality it should not cause problems other than rolling out a
> > little
> > > > >> more
> > > > >> > log segments.
> > > > >> >
> > > > >> > 5. We say "Instead, it is only monotonically increasing within
> > each
> > > > time
> > > > >> > > index file. i.e. It is possible that the time index file for a
> > > later
> > > > >> log
> > > > >> > > segment contains smaller timestamp than some timestamp in the
> > time
> > > > >> index
> > > > >> > > file of an earlier segment.". I think it would be good to
> > explain
> > > > >> under
> > > > >> > > which scenario a time index file for a later log segment
> > contains
> > > a
> > > > >> > smaller
> > > > >> > > timestamp (is this only when CreateTime is used?).
> > > > >> > >
> > > > >> > Yes, it only happens when CreateTime is used.
> > > > >> >
> > > > >> >
> > > > >> > > 6. We say "When searching by timestamp, broker will start from
> > the
> > > > >> > earliest
> > > > >> > > log segment and check the last time index entry.". The
> existing
> > > > logic
> > > > >> > > searches from newest segment backwards. Is there a reason why
> we
> > > are
> > > > >> > > changing it?
> > > > >> > >
> > > > >> > Suppose segment 0 has max timestamp 100, segment 1 has max
> > timestamp
> > > > 50
> > > > >> and
> > > > >> > segment 3 has max timestamp 90, now user want to search for
> > > timestamp
> > > > >> 80.
> > > > >> > If we search backwards, we have to take a look at all the
> > segments.
> > > If
> > > > >> we
> > > > >> > search forward, we will stop at the first segment whose max
> > > timestamp
> > > > is
> > > > >> > greater than 80 (i.e all the previous segments has smaller
> > > timestamps)
> > > > >> and
> > > > >> > start the finer search on that segment.
> > > > >> >
> > > > >> >
> > > > >> > > 7. Do you mind if I fix typos and minor grammar issues
> directly
> > in
> > > > the
> > > > >> > > wiki? It seems easier than doing that via email.
> > > > >> > >
> > > > >> > Not at all, thanks for help.
> > > > >> >
> > > > >> >
> > > > >> > > Thanks,
> > > > >> > > Ismael
> > > > >> > >
> > > > >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <
> > becket.qin@gmail.com>
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > Hi all,
> > > > >> > > >
> > > > >> > > > I updated KIP-33 based on the initial implementation. Per
> > > > >> discussion on
> > > > >> > > > yesterday's KIP hangout, I would like to initiate the new
> vote
> > > > >> thread
> > > > >> > for
> > > > >> > > > KIP-33.
> > > > >> > > >
> > > > >> > > > The KIP wiki:
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 33+-+Add+a+time+based+log+index
> > > > >> > > >
> > > > >> > > > Here is a brief summary of the KIP:
> > > > >> > > > 1. We propose to add a time index for each log segment.
> > > > >> > > > 2. The time indices are going to be used of log retention,
> log
> > > > >> rolling
> > > > >> > > and
> > > > >> > > > message search by timestamp.
> > > > >> > > >
> > > > >> > > > There was an old voting thread which has some discussions on
> > > this
> > > > >> KIP.
> > > > >> > > The
> > > > >> > > > mail thread link is following:
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%
> > > 3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com
> %3E
> > > > >> > > >
> > > > >> > > > I have the following WIP patch for reference. It needs a few
> > > more
> > > > >> unit
> > > > >> > > > tests and documentation. Other than that it should run fine.
> > > > >> > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > https://github.com/becketqin/kafka/commit/
> > 712357a3fbf1423e05f9eed7d2fed5
> > > b6fe6c37b7
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Jiangjie (Becket) Qin
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Guozhang Wang <wa...@gmail.com>.

Some of the streams integration tests also encounters this issue where the
timestamps we are using in the test are very small (e.g. 1,2,3...) which
cause the log to roll upon each append, and the old segment gets deleted
very soon. Arguably this can be resolved to enforce LogAppendTime
configuration on the embedded server.

+1 on the proposed change, makes sense to me.

Guozhang


On Tue, Aug 30, 2016 at 4:33 PM, Becket Qin <be...@gmail.com> wrote:

> Hi folks,
>
> Here is another update on the change of time based log rolling.
>
> After the latest implementation, we encountered KAFKA-4099. The issue is
> that if users move replicas, for the messages in the old segments, the new
> replica will create one log segment for each message. The root cause of
> this is we are comparing the wall clock time with the message timestamp. A
> solution to that is also described in KAFKA-4099, which is to change the
> log rolling purely based on the timestamp in the messages. More
> specifically, we roll out the log segment if the timestamp in the current
> message is greater than the timestamp of the first message in the segment
> by more than log.roll.ms. This approach is wall clock independent and
> should solve the problem. With message.timestamp.difference.max.ms
> configuration, we can achieve 1) the log segment will be rolled out in a
> bounded time, 2) no excessively large timestamp will be accepted and cause
> frequent log rolling.
>
> Any concern regarding this change?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Jun 13, 2016 at 2:30 PM, Guozhang Wang <wa...@gmail.com> wrote:
>
> > Thanks Jiangjie,
> >
> > I see the need for sensitive data purging, the above proposed change
> LGTM.
> > One minor concern is that a wrongly marked timestamp on the first record
> > could cause the segment to roll much later / earlier, though it may be
> > rare.
> >
> > Guozhang
> >
> > On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <be...@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > During the implementation of KIP-33, we found it might be useful to
> have
> > a
> > > more deterministic time based log rolling than what proposed in the
> KIP.
> > >
> > > The current KIP proposal uses the largest timestamp in the segment for
> > time
> > > based rolling. The active log segment only rolls when there is no
> message
> > > appended in max.roll.ms since the largest timestamp in the segment.
> i.e.
> > > the rolling time may change if user keeping appending messages into the
> > > segment. This may not be a desirable behavior for people who have
> > sensitive
> > > data and want to make sure they are removed after some time.
> > >
> > > To solve the above issue, we want to modify the KIP proposal regarding
> > the
> > > time based rolling to the following behavior. The time based log
> rolling
> > > will be based on the first message with a timestamp in the log segment
> if
> > > there is such a message. If no message in the segment has a timestamp,
> > the
> > > time based log rolling will still be based on log segment create time,
> > > which is the same as we are doing now. The reasons we don't want to
> > always
> > > roll based on file create time are because 1) the message timestamp may
> > be
> > > assigned by clients which can be different from the create time of the
> > log
> > > segment file. 2) On some Linux, the file create time is not available,
> so
> > > using segment file create time may not always work.
> > >
> > > Do people have any concern for this change? I will update the KIP if
> > people
> > > think the change is OK.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > > > Thanks Joel and Ismael. I just updated the KIP based on your
> feedback.
> > > >
> > > > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> > > >
> > > > Thanks everyone for the reading, feedback and voting!
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk>
> > wrote:
> > > >
> > > >> Thanks Becket. I think it would be nice to update the KIP with
> regards
> > > to
> > > >> point 3 and 4.
> > > >>
> > > >> In any case, +1 (non-binding)
> > > >>
> > > >> Ismael
> > > >>
> > > >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Thanks for the comments Ismael. Please see the replies inline.
> > > >> >
> > > >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk>
> > > wrote:
> > > >> >
> > > >> > > Hi Jiangjie,
> > > >> > >
> > > >> > > Thanks for the KIP, it's a nice improvement. Since it seems like
> > we
> > > >> have
> > > >> > > been using the voting thread for discussion, I'll do the same.
> > > >> > >
> > > >> > > A few minor comments/questions:
> > > >> > >
> > > >> > > 1. The proposed name for the time index file
> > > >> > `SegmentBaseOffset.timeindex`.
> > > >> > > Would `SegmentBaseOffset.time-index` be a little better? It
> would
> > > >> clearly
> > > >> > > separate the type of index in case we add additional index types
> > in
> > > >> the
> > > >> > > future.
> > > >> > >
> > > >> > I have no strong opinion on this, I am not adding any thing
> > separator
> > > >> > because it is more regex friendly.
> > > >> > I am not sure about the other indexes, time and space seems to be
> > two
> > > >> most
> > > >> > common dimensions.
> > > >> >
> > > >> > 2. When describing the time index entry, we say "Offset - the next
> > > >> offset
> > > >> > > when the time index entry is inserted". I found the mention of
> > > `next`
> > > >> a
> > > >> > bit
> > > >> > > confusing as it looks to me like the time index entry has the
> > first
> > > >> > offset
> > > >> > > in the message set.
> > > >> >
> > > >> > This semantic meaning is a little different from the offset index.
> > The
> > > >> > offset index information is self-contained by nature. i.e. all the
> > > >> offsets
> > > >> > before is smaller than the offset of this message set. So we only
> > need
> > > >> to
> > > >> > say "the offset of this message set is OFFSET". This does not
> quite
> > > >> apply
> > > >> > to the time index because the max timestamp may or may not be in
> the
> > > >> > message set being appended. So we have to either say, "the max
> > > timestamp
> > > >> > before I append this message set is T", or "the max timestamp
> after
> > I
> > > >> > appended this message set is T". The former case means that we can
> > > skip
> > > >> all
> > > >> > the previous messages if we are looking for a timestamp > T and
> > start
> > > >> from
> > > >> > this offset. The latter one means if we are searching for
> timestamp
> > >
> > > >> T, we
> > > >> > should start after this message set, which is essentially the same
> > as
> > > >> the
> > > >> > former case but require an additional interpretation.
> > > >> >
> > > >> > 3. We say "The default initial / max size of the time index files
> is
> > > the
> > > >> > > same as the offset index files. (time index entry is 1.5x of the
> > > size
> > > >> of
> > > >> > > offset index entry, user should set the configuration
> > accordingly)".
> > > >> It
> > > >> > may
> > > >> > > be worth elaborating a little on what a user should do with
> > regards
> > > to
> > > >> > this
> > > >> > > configuration when upgrading (ie maybe under "Compatibility,
> > > >> Deprecation,
> > > >> > > and Migration Plan").
> > > >> > >
> > > >> > Makes sense.
> > > >> >
> > > >> >
> > > >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> > > probably
> > > >> > > to change
> > > >> > > the default index size to 2MB to match the default log segment
> > size"
> > > >> and
> > > >> > > you seemed to agree. I couldn't find anything about this in the
> > KIP.
> > > >> Are
> > > >> > we
> > > >> > > still doing it?
> > > >> > >
> > > >> > Yes, we can still make the change for default settings. User might
> > > want
> > > >> to
> > > >> > set the index size a little larger if they have a customized size
> > but
> > > in
> > > >> > reality it should not cause problems other than rolling out a
> little
> > > >> more
> > > >> > log segments.
> > > >> >
> > > >> > 5. We say "Instead, it is only monotonically increasing within
> each
> > > time
> > > >> > > index file. i.e. It is possible that the time index file for a
> > later
> > > >> log
> > > >> > > segment contains smaller timestamp than some timestamp in the
> time
> > > >> index
> > > >> > > file of an earlier segment.". I think it would be good to
> explain
> > > >> under
> > > >> > > which scenario a time index file for a later log segment
> contains
> > a
> > > >> > smaller
> > > >> > > timestamp (is this only when CreateTime is used?).
> > > >> > >
> > > >> > Yes, it only happens when CreateTime is used.
> > > >> >
> > > >> >
> > > >> > > 6. We say "When searching by timestamp, broker will start from
> the
> > > >> > earliest
> > > >> > > log segment and check the last time index entry.". The existing
> > > logic
> > > >> > > searches from newest segment backwards. Is there a reason why we
> > are
> > > >> > > changing it?
> > > >> > >
> > > >> > Suppose segment 0 has max timestamp 100, segment 1 has max
> timestamp
> > > 50
> > > >> and
> > > >> > segment 3 has max timestamp 90, now user want to search for
> > timestamp
> > > >> 80.
> > > >> > If we search backwards, we have to take a look at all the
> segments.
> > If
> > > >> we
> > > >> > search forward, we will stop at the first segment whose max
> > timestamp
> > > is
> > > >> > greater than 80 (i.e all the previous segments has smaller
> > timestamps)
> > > >> and
> > > >> > start the finer search on that segment.
> > > >> >
> > > >> >
> > > >> > > 7. Do you mind if I fix typos and minor grammar issues directly
> in
> > > the
> > > >> > > wiki? It seems easier than doing that via email.
> > > >> > >
> > > >> > Not at all, thanks for help.
> > > >> >
> > > >> >
> > > >> > > Thanks,
> > > >> > > Ismael
> > > >> > >
> > > >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <
> becket.qin@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > I updated KIP-33 based on the initial implementation. Per
> > > >> discussion on
> > > >> > > > yesterday's KIP hangout, I would like to initiate the new vote
> > > >> thread
> > > >> > for
> > > >> > > > KIP-33.
> > > >> > > >
> > > >> > > > The KIP wiki:
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 33+-+Add+a+time+based+log+index
> > > >> > > >
> > > >> > > > Here is a brief summary of the KIP:
> > > >> > > > 1. We propose to add a time index for each log segment.
> > > >> > > > 2. The time indices are going to be used of log retention, log
> > > >> rolling
> > > >> > > and
> > > >> > > > message search by timestamp.
> > > >> > > >
> > > >> > > > There was an old voting thread which has some discussions on
> > this
> > > >> KIP.
> > > >> > > The
> > > >> > > > mail thread link is following:
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%
> > 3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > > >> > > >
> > > >> > > > I have the following WIP patch for reference. It needs a few
> > more
> > > >> unit
> > > >> > > > tests and documentation. Other than that it should run fine.
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > https://github.com/becketqin/kafka/commit/
> 712357a3fbf1423e05f9eed7d2fed5
> > b6fe6c37b7
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jiangjie (Becket) Qin
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Hi folks,

Here is another update on the change of time based log rolling.

After the latest implementation, we encountered KAFKA-4099. The issue is
that if users move replicas, for the messages in the old segments, the new
replica will create one log segment for each message. The root cause of
this is we are comparing the wall clock time with the message timestamp. A
solution to that is also described in KAFKA-4099, which is to change the
log rolling purely based on the timestamp in the messages. More
specifically, we roll out the log segment if the timestamp in the current
message is greater than the timestamp of the first message in the segment
by more than log.roll.ms. This approach is wall clock independent and
should solve the problem. With message.timestamp.difference.max.ms
configuration, we can achieve 1) the log segment will be rolled out in a
bounded time, 2) no excessively large timestamp will be accepted and cause
frequent log rolling.

Any concern regarding this change?

Thanks,

Jiangjie (Becket) Qin

On Mon, Jun 13, 2016 at 2:30 PM, Guozhang Wang <wa...@gmail.com> wrote:

> Thanks Jiangjie,
>
> I see the need for sensitive data purging, the above proposed change LGTM.
> One minor concern is that a wrongly marked timestamp on the first record
> could cause the segment to roll much later / earlier, though it may be
> rare.
>
> Guozhang
>
> On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi,
> >
> > During the implementation of KIP-33, we found it might be useful to have
> a
> > more deterministic time based log rolling than what proposed in the KIP.
> >
> > The current KIP proposal uses the largest timestamp in the segment for
> time
> > based rolling. The active log segment only rolls when there is no message
> > appended in max.roll.ms since the largest timestamp in the segment. i.e.
> > the rolling time may change if user keeping appending messages into the
> > segment. This may not be a desirable behavior for people who have
> sensitive
> > data and want to make sure they are removed after some time.
> >
> > To solve the above issue, we want to modify the KIP proposal regarding
> the
> > time based rolling to the following behavior. The time based log rolling
> > will be based on the first message with a timestamp in the log segment if
> > there is such a message. If no message in the segment has a timestamp,
> the
> > time based log rolling will still be based on log segment create time,
> > which is the same as we are doing now. The reasons we don't want to
> always
> > roll based on file create time are because 1) the message timestamp may
> be
> > assigned by clients which can be different from the create time of the
> log
> > segment file. 2) On some Linux, the file create time is not available, so
> > using segment file create time may not always work.
> >
> > Do people have any concern for this change? I will update the KIP if
> people
> > think the change is OK.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com>
> wrote:
> >
> > > Thanks Joel and Ismael. I just updated the KIP based on your feedback.
> > >
> > > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> > >
> > > Thanks everyone for the reading, feedback and voting!
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk>
> wrote:
> > >
> > >> Thanks Becket. I think it would be nice to update the KIP with regards
> > to
> > >> point 3 and 4.
> > >>
> > >> In any case, +1 (non-binding)
> > >>
> > >> Ismael
> > >>
> > >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >>
> > >> > Thanks for the comments Ismael. Please see the replies inline.
> > >> >
> > >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk>
> > wrote:
> > >> >
> > >> > > Hi Jiangjie,
> > >> > >
> > >> > > Thanks for the KIP, it's a nice improvement. Since it seems like
> we
> > >> have
> > >> > > been using the voting thread for discussion, I'll do the same.
> > >> > >
> > >> > > A few minor comments/questions:
> > >> > >
> > >> > > 1. The proposed name for the time index file
> > >> > `SegmentBaseOffset.timeindex`.
> > >> > > Would `SegmentBaseOffset.time-index` be a little better? It would
> > >> clearly
> > >> > > separate the type of index in case we add additional index types
> in
> > >> the
> > >> > > future.
> > >> > >
> > >> > I have no strong opinion on this, I am not adding any thing
> separator
> > >> > because it is more regex friendly.
> > >> > I am not sure about the other indexes, time and space seems to be
> two
> > >> most
> > >> > common dimensions.
> > >> >
> > >> > 2. When describing the time index entry, we say "Offset - the next
> > >> offset
> > >> > > when the time index entry is inserted". I found the mention of
> > `next`
> > >> a
> > >> > bit
> > >> > > confusing as it looks to me like the time index entry has the
> first
> > >> > offset
> > >> > > in the message set.
> > >> >
> > >> > This semantic meaning is a little different from the offset index.
> The
> > >> > offset index information is self-contained by nature. i.e. all the
> > >> offsets
> > >> > before is smaller than the offset of this message set. So we only
> need
> > >> to
> > >> > say "the offset of this message set is OFFSET". This does not quite
> > >> apply
> > >> > to the time index because the max timestamp may or may not be in the
> > >> > message set being appended. So we have to either say, "the max
> > timestamp
> > >> > before I append this message set is T", or "the max timestamp after
> I
> > >> > appended this message set is T". The former case means that we can
> > skip
> > >> all
> > >> > the previous messages if we are looking for a timestamp > T and
> start
> > >> from
> > >> > this offset. The latter one means if we are searching for timestamp
> >
> > >> T, we
> > >> > should start after this message set, which is essentially the same
> as
> > >> the
> > >> > former case but require an additional interpretation.
> > >> >
> > >> > 3. We say "The default initial / max size of the time index files is
> > the
> > >> > > same as the offset index files. (time index entry is 1.5x of the
> > size
> > >> of
> > >> > > offset index entry, user should set the configuration
> accordingly)".
> > >> It
> > >> > may
> > >> > > be worth elaborating a little on what a user should do with
> regards
> > to
> > >> > this
> > >> > > configuration when upgrading (ie maybe under "Compatibility,
> > >> Deprecation,
> > >> > > and Migration Plan").
> > >> > >
> > >> > Makes sense.
> > >> >
> > >> >
> > >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> > probably
> > >> > > to change
> > >> > > the default index size to 2MB to match the default log segment
> size"
> > >> and
> > >> > > you seemed to agree. I couldn't find anything about this in the
> KIP.
> > >> Are
> > >> > we
> > >> > > still doing it?
> > >> > >
> > >> > Yes, we can still make the change for default settings. User might
> > want
> > >> to
> > >> > set the index size a little larger if they have a customized size
> but
> > in
> > >> > reality it should not cause problems other than rolling out a little
> > >> more
> > >> > log segments.
> > >> >
> > >> > 5. We say "Instead, it is only monotonically increasing within each
> > time
> > >> > > index file. i.e. It is possible that the time index file for a
> later
> > >> log
> > >> > > segment contains smaller timestamp than some timestamp in the time
> > >> index
> > >> > > file of an earlier segment.". I think it would be good to explain
> > >> under
> > >> > > which scenario a time index file for a later log segment contains
> a
> > >> > smaller
> > >> > > timestamp (is this only when CreateTime is used?).
> > >> > >
> > >> > Yes, it only happens when CreateTime is used.
> > >> >
> > >> >
> > >> > > 6. We say "When searching by timestamp, broker will start from the
> > >> > earliest
> > >> > > log segment and check the last time index entry.". The existing
> > logic
> > >> > > searches from newest segment backwards. Is there a reason why we
> are
> > >> > > changing it?
> > >> > >
> > >> > Suppose segment 0 has max timestamp 100, segment 1 has max timestamp
> > 50
> > >> and
> > >> > segment 3 has max timestamp 90, now user want to search for
> timestamp
> > >> 80.
> > >> > If we search backwards, we have to take a look at all the segments.
> If
> > >> we
> > >> > search forward, we will stop at the first segment whose max
> timestamp
> > is
> > >> > greater than 80 (i.e all the previous segments has smaller
> timestamps)
> > >> and
> > >> > start the finer search on that segment.
> > >> >
> > >> >
> > >> > > 7. Do you mind if I fix typos and minor grammar issues directly in
> > the
> > >> > > wiki? It seems easier than doing that via email.
> > >> > >
> > >> > Not at all, thanks for help.
> > >> >
> > >> >
> > >> > > Thanks,
> > >> > > Ismael
> > >> > >
> > >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > I updated KIP-33 based on the initial implementation. Per
> > >> discussion on
> > >> > > > yesterday's KIP hangout, I would like to initiate the new vote
> > >> thread
> > >> > for
> > >> > > > KIP-33.
> > >> > > >
> > >> > > > The KIP wiki:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 33+-+Add+a+time+based+log+index
> > >> > > >
> > >> > > > Here is a brief summary of the KIP:
> > >> > > > 1. We propose to add a time index for each log segment.
> > >> > > > 2. The time indices are going to be used of log retention, log
> > >> rolling
> > >> > > and
> > >> > > > message search by timestamp.
> > >> > > >
> > >> > > > There was an old voting thread which has some discussions on
> this
> > >> KIP.
> > >> > > The
> > >> > > > mail thread link is following:
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%
> 3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > >> > > >
> > >> > > > I have the following WIP patch for reference. It needs a few
> more
> > >> unit
> > >> > > > tests and documentation. Other than that it should run fine.
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5
> b6fe6c37b7
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Jiangjie (Becket) Qin
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Guozhang Wang <wa...@gmail.com>.

Thanks Jiangjie,

I see the need for sensitive data purging, the above proposed change LGTM.
One minor concern is that a wrongly marked timestamp on the first record
could cause the segment to roll much later / earlier, though it may be rare.

Guozhang

On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <be...@gmail.com> wrote:

> Hi,
>
> During the implementation of KIP-33, we found it might be useful to have a
> more deterministic time based log rolling than what proposed in the KIP.
>
> The current KIP proposal uses the largest timestamp in the segment for time
> based rolling. The active log segment only rolls when there is no message
> appended in max.roll.ms since the largest timestamp in the segment. i.e.
> the rolling time may change if user keeping appending messages into the
> segment. This may not be a desirable behavior for people who have sensitive
> data and want to make sure they are removed after some time.
>
> To solve the above issue, we want to modify the KIP proposal regarding the
> time based rolling to the following behavior. The time based log rolling
> will be based on the first message with a timestamp in the log segment if
> there is such a message. If no message in the segment has a timestamp, the
> time based log rolling will still be based on log segment create time,
> which is the same as we are doing now. The reasons we don't want to always
> roll based on file create time are because 1) the message timestamp may be
> assigned by clients which can be different from the create time of the log
> segment file. 2) On some Linux, the file create time is not available, so
> using segment file create time may not always work.
>
> Do people have any concern for this change? I will update the KIP if people
> think the change is OK.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Thanks Joel and Ismael. I just updated the KIP based on your feedback.
> >
> > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> >
> > Thanks everyone for the reading, feedback and voting!
> >
> > Jiangjie (Becket) Qin
> >
> > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> >> Thanks Becket. I think it would be nice to update the KIP with regards
> to
> >> point 3 and 4.
> >>
> >> In any case, +1 (non-binding)
> >>
> >> Ismael
> >>
> >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <be...@gmail.com>
> wrote:
> >>
> >> > Thanks for the comments Ismael. Please see the replies inline.
> >> >
> >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk>
> wrote:
> >> >
> >> > > Hi Jiangjie,
> >> > >
> >> > > Thanks for the KIP, it's a nice improvement. Since it seems like we
> >> have
> >> > > been using the voting thread for discussion, I'll do the same.
> >> > >
> >> > > A few minor comments/questions:
> >> > >
> >> > > 1. The proposed name for the time index file
> >> > `SegmentBaseOffset.timeindex`.
> >> > > Would `SegmentBaseOffset.time-index` be a little better? It would
> >> clearly
> >> > > separate the type of index in case we add additional index types in
> >> the
> >> > > future.
> >> > >
> >> > I have no strong opinion on this, I am not adding any thing separator
> >> > because it is more regex friendly.
> >> > I am not sure about the other indexes, time and space seems to be two
> >> most
> >> > common dimensions.
> >> >
> >> > 2. When describing the time index entry, we say "Offset - the next
> >> offset
> >> > > when the time index entry is inserted". I found the mention of
> `next`
> >> a
> >> > bit
> >> > > confusing as it looks to me like the time index entry has the first
> >> > offset
> >> > > in the message set.
> >> >
> >> > This semantic meaning is a little different from the offset index. The
> >> > offset index information is self-contained by nature. i.e. all the
> >> offsets
> >> > before is smaller than the offset of this message set. So we only need
> >> to
> >> > say "the offset of this message set is OFFSET". This does not quite
> >> apply
> >> > to the time index because the max timestamp may or may not be in the
> >> > message set being appended. So we have to either say, "the max
> timestamp
> >> > before I append this message set is T", or "the max timestamp after I
> >> > appended this message set is T". The former case means that we can
> skip
> >> all
> >> > the previous messages if we are looking for a timestamp > T and start
> >> from
> >> > this offset. The latter one means if we are searching for timestamp >
> >> T, we
> >> > should start after this message set, which is essentially the same as
> >> the
> >> > former case but require an additional interpretation.
> >> >
> >> > 3. We say "The default initial / max size of the time index files is
> the
> >> > > same as the offset index files. (time index entry is 1.5x of the
> size
> >> of
> >> > > offset index entry, user should set the configuration accordingly)".
> >> It
> >> > may
> >> > > be worth elaborating a little on what a user should do with regards
> to
> >> > this
> >> > > configuration when upgrading (ie maybe under "Compatibility,
> >> Deprecation,
> >> > > and Migration Plan").
> >> > >
> >> > Makes sense.
> >> >
> >> >
> >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> probably
> >> > > to change
> >> > > the default index size to 2MB to match the default log segment size"
> >> and
> >> > > you seemed to agree. I couldn't find anything about this in the KIP.
> >> Are
> >> > we
> >> > > still doing it?
> >> > >
> >> > Yes, we can still make the change for default settings. User might
> want
> >> to
> >> > set the index size a little larger if they have a customized size but
> in
> >> > reality it should not cause problems other than rolling out a little
> >> more
> >> > log segments.
> >> >
> >> > 5. We say "Instead, it is only monotonically increasing within each
> time
> >> > > index file. i.e. It is possible that the time index file for a later
> >> log
> >> > > segment contains smaller timestamp than some timestamp in the time
> >> index
> >> > > file of an earlier segment.". I think it would be good to explain
> >> under
> >> > > which scenario a time index file for a later log segment contains a
> >> > smaller
> >> > > timestamp (is this only when CreateTime is used?).
> >> > >
> >> > Yes, it only happens when CreateTime is used.
> >> >
> >> >
> >> > > 6. We say "When searching by timestamp, broker will start from the
> >> > earliest
> >> > > log segment and check the last time index entry.". The existing
> logic
> >> > > searches from newest segment backwards. Is there a reason why we are
> >> > > changing it?
> >> > >
> >> > Suppose segment 0 has max timestamp 100, segment 1 has max timestamp
> 50
> >> and
> >> > segment 3 has max timestamp 90, now user want to search for timestamp
> >> 80.
> >> > If we search backwards, we have to take a look at all the segments. If
> >> we
> >> > search forward, we will stop at the first segment whose max timestamp
> is
> >> > greater than 80 (i.e all the previous segments has smaller timestamps)
> >> and
> >> > start the finer search on that segment.
> >> >
> >> >
> >> > > 7. Do you mind if I fix typos and minor grammar issues directly in
> the
> >> > > wiki? It seems easier than doing that via email.
> >> > >
> >> > Not at all, thanks for help.
> >> >
> >> >
> >> > > Thanks,
> >> > > Ismael
> >> > >
> >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com>
> >> wrote:
> >> > >
> >> > > > Hi all,
> >> > > >
> >> > > > I updated KIP-33 based on the initial implementation. Per
> >> discussion on
> >> > > > yesterday's KIP hangout, I would like to initiate the new vote
> >> thread
> >> > for
> >> > > > KIP-33.
> >> > > >
> >> > > > The KIP wiki:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> >> > > >
> >> > > > Here is a brief summary of the KIP:
> >> > > > 1. We propose to add a time index for each log segment.
> >> > > > 2. The time indices are going to be used of log retention, log
> >> rolling
> >> > > and
> >> > > > message search by timestamp.
> >> > > >
> >> > > > There was an old voting thread which has some discussions on this
> >> KIP.
> >> > > The
> >> > > > mail thread link is following:
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> >> > > >
> >> > > > I have the following WIP patch for reference. It needs a few more
> >> unit
> >> > > > tests and documentation. Other than that it should run fine.
> >> > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jiangjie (Becket) Qin
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>



-- 
-- Guozhang

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Hi,

During the implementation of KIP-33, we found it might be useful to have a
more deterministic time based log rolling than what proposed in the KIP.

The current KIP proposal uses the largest timestamp in the segment for time
based rolling. The active log segment only rolls when there is no message
appended in max.roll.ms since the largest timestamp in the segment. i.e.
the rolling time may change if user keeping appending messages into the
segment. This may not be a desirable behavior for people who have sensitive
data and want to make sure they are removed after some time.

To solve the above issue, we want to modify the KIP proposal regarding the
time based rolling to the following behavior. The time based log rolling
will be based on the first message with a timestamp in the log segment if
there is such a message. If no message in the segment has a timestamp, the
time based log rolling will still be based on log segment create time,
which is the same as we are doing now. The reasons we don't want to always
roll based on file create time are because 1) the message timestamp may be
assigned by clients which can be different from the create time of the log
segment file. 2) On some Linux, the file create time is not available, so
using segment file create time may not always work.

Do people have any concern for this change? I will update the KIP if people
think the change is OK.

Thanks,

Jiangjie (Becket) Qin

On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <be...@gmail.com> wrote:

> Thanks Joel and Ismael. I just updated the KIP based on your feedback.
>
> KIP-33 has passed with +4 (binding) and +2 (non-binding)
>
> Thanks everyone for the reading, feedback and voting!
>
> Jiangjie (Becket) Qin
>
> On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk> wrote:
>
>> Thanks Becket. I think it would be nice to update the KIP with regards to
>> point 3 and 4.
>>
>> In any case, +1 (non-binding)
>>
>> Ismael
>>
>> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <be...@gmail.com> wrote:
>>
>> > Thanks for the comments Ismael. Please see the replies inline.
>> >
>> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk> wrote:
>> >
>> > > Hi Jiangjie,
>> > >
>> > > Thanks for the KIP, it's a nice improvement. Since it seems like we
>> have
>> > > been using the voting thread for discussion, I'll do the same.
>> > >
>> > > A few minor comments/questions:
>> > >
>> > > 1. The proposed name for the time index file
>> > `SegmentBaseOffset.timeindex`.
>> > > Would `SegmentBaseOffset.time-index` be a little better? It would
>> clearly
>> > > separate the type of index in case we add additional index types in
>> the
>> > > future.
>> > >
>> > I have no strong opinion on this, I am not adding any thing separator
>> > because it is more regex friendly.
>> > I am not sure about the other indexes, time and space seems to be two
>> most
>> > common dimensions.
>> >
>> > 2. When describing the time index entry, we say "Offset - the next
>> offset
>> > > when the time index entry is inserted". I found the mention of `next`
>> a
>> > bit
>> > > confusing as it looks to me like the time index entry has the first
>> > offset
>> > > in the message set.
>> >
>> > This semantic meaning is a little different from the offset index. The
>> > offset index information is self-contained by nature. i.e. all the
>> offsets
>> > before is smaller than the offset of this message set. So we only need
>> to
>> > say "the offset of this message set is OFFSET". This does not quite
>> apply
>> > to the time index because the max timestamp may or may not be in the
>> > message set being appended. So we have to either say, "the max timestamp
>> > before I append this message set is T", or "the max timestamp after I
>> > appended this message set is T". The former case means that we can skip
>> all
>> > the previous messages if we are looking for a timestamp > T and start
>> from
>> > this offset. The latter one means if we are searching for timestamp >
>> T, we
>> > should start after this message set, which is essentially the same as
>> the
>> > former case but require an additional interpretation.
>> >
>> > 3. We say "The default initial / max size of the time index files is the
>> > > same as the offset index files. (time index entry is 1.5x of the size
>> of
>> > > offset index entry, user should set the configuration accordingly)".
>> It
>> > may
>> > > be worth elaborating a little on what a user should do with regards to
>> > this
>> > > configuration when upgrading (ie maybe under "Compatibility,
>> Deprecation,
>> > > and Migration Plan").
>> > >
>> > Makes sense.
>> >
>> >
>> > > 4. In a previous vote thread, Jun said "The simplest thing is probably
>> > > to change
>> > > the default index size to 2MB to match the default log segment size"
>> and
>> > > you seemed to agree. I couldn't find anything about this in the KIP.
>> Are
>> > we
>> > > still doing it?
>> > >
>> > Yes, we can still make the change for default settings. User might want
>> to
>> > set the index size a little larger if they have a customized size but in
>> > reality it should not cause problems other than rolling out a little
>> more
>> > log segments.
>> >
>> > 5. We say "Instead, it is only monotonically increasing within each time
>> > > index file. i.e. It is possible that the time index file for a later
>> log
>> > > segment contains smaller timestamp than some timestamp in the time
>> index
>> > > file of an earlier segment.". I think it would be good to explain
>> under
>> > > which scenario a time index file for a later log segment contains a
>> > smaller
>> > > timestamp (is this only when CreateTime is used?).
>> > >
>> > Yes, it only happens when CreateTime is used.
>> >
>> >
>> > > 6. We say "When searching by timestamp, broker will start from the
>> > earliest
>> > > log segment and check the last time index entry.". The existing logic
>> > > searches from newest segment backwards. Is there a reason why we are
>> > > changing it?
>> > >
>> > Suppose segment 0 has max timestamp 100, segment 1 has max timestamp 50
>> and
>> > segment 3 has max timestamp 90, now user want to search for timestamp
>> 80.
>> > If we search backwards, we have to take a look at all the segments. If
>> we
>> > search forward, we will stop at the first segment whose max timestamp is
>> > greater than 80 (i.e all the previous segments has smaller timestamps)
>> and
>> > start the finer search on that segment.
>> >
>> >
>> > > 7. Do you mind if I fix typos and minor grammar issues directly in the
>> > > wiki? It seems easier than doing that via email.
>> > >
>> > Not at all, thanks for help.
>> >
>> >
>> > > Thanks,
>> > > Ismael
>> > >
>> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com>
>> wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > I updated KIP-33 based on the initial implementation. Per
>> discussion on
>> > > > yesterday's KIP hangout, I would like to initiate the new vote
>> thread
>> > for
>> > > > KIP-33.
>> > > >
>> > > > The KIP wiki:
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>> > > >
>> > > > Here is a brief summary of the KIP:
>> > > > 1. We propose to add a time index for each log segment.
>> > > > 2. The time indices are going to be used of log retention, log
>> rolling
>> > > and
>> > > > message search by timestamp.
>> > > >
>> > > > There was an old voting thread which has some discussions on this
>> KIP.
>> > > The
>> > > > mail thread link is following:
>> > > >
>> > > >
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
>> > > >
>> > > > I have the following WIP patch for reference. It needs a few more
>> unit
>> > > > tests and documentation. Other than that it should run fine.
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > >
>> >
>>
>
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Thanks Joel and Ismael. I just updated the KIP based on your feedback.

KIP-33 has passed with +4 (binding) and +2 (non-binding)

Thanks everyone for the reading, feedback and voting!

Jiangjie (Becket) Qin

On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <is...@juma.me.uk> wrote:

> Thanks Becket. I think it would be nice to update the KIP with regards to
> point 3 and 4.
>
> In any case, +1 (non-binding)
>
> Ismael
>
> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <be...@gmail.com> wrote:
>
> > Thanks for the comments Ismael. Please see the replies inline.
> >
> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk> wrote:
> >
> > > Hi Jiangjie,
> > >
> > > Thanks for the KIP, it's a nice improvement. Since it seems like we
> have
> > > been using the voting thread for discussion, I'll do the same.
> > >
> > > A few minor comments/questions:
> > >
> > > 1. The proposed name for the time index file
> > `SegmentBaseOffset.timeindex`.
> > > Would `SegmentBaseOffset.time-index` be a little better? It would
> clearly
> > > separate the type of index in case we add additional index types in the
> > > future.
> > >
> > I have no strong opinion on this, I am not adding any thing separator
> > because it is more regex friendly.
> > I am not sure about the other indexes, time and space seems to be two
> most
> > common dimensions.
> >
> > 2. When describing the time index entry, we say "Offset - the next offset
> > > when the time index entry is inserted". I found the mention of `next` a
> > bit
> > > confusing as it looks to me like the time index entry has the first
> > offset
> > > in the message set.
> >
> > This semantic meaning is a little different from the offset index. The
> > offset index information is self-contained by nature. i.e. all the
> offsets
> > before is smaller than the offset of this message set. So we only need to
> > say "the offset of this message set is OFFSET". This does not quite apply
> > to the time index because the max timestamp may or may not be in the
> > message set being appended. So we have to either say, "the max timestamp
> > before I append this message set is T", or "the max timestamp after I
> > appended this message set is T". The former case means that we can skip
> all
> > the previous messages if we are looking for a timestamp > T and start
> from
> > this offset. The latter one means if we are searching for timestamp > T,
> we
> > should start after this message set, which is essentially the same as the
> > former case but require an additional interpretation.
> >
> > 3. We say "The default initial / max size of the time index files is the
> > > same as the offset index files. (time index entry is 1.5x of the size
> of
> > > offset index entry, user should set the configuration accordingly)". It
> > may
> > > be worth elaborating a little on what a user should do with regards to
> > this
> > > configuration when upgrading (ie maybe under "Compatibility,
> Deprecation,
> > > and Migration Plan").
> > >
> > Makes sense.
> >
> >
> > > 4. In a previous vote thread, Jun said "The simplest thing is probably
> > > to change
> > > the default index size to 2MB to match the default log segment size"
> and
> > > you seemed to agree. I couldn't find anything about this in the KIP.
> Are
> > we
> > > still doing it?
> > >
> > Yes, we can still make the change for default settings. User might want
> to
> > set the index size a little larger if they have a customized size but in
> > reality it should not cause problems other than rolling out a little more
> > log segments.
> >
> > 5. We say "Instead, it is only monotonically increasing within each time
> > > index file. i.e. It is possible that the time index file for a later
> log
> > > segment contains smaller timestamp than some timestamp in the time
> index
> > > file of an earlier segment.". I think it would be good to explain under
> > > which scenario a time index file for a later log segment contains a
> > smaller
> > > timestamp (is this only when CreateTime is used?).
> > >
> > Yes, it only happens when CreateTime is used.
> >
> >
> > > 6. We say "When searching by timestamp, broker will start from the
> > earliest
> > > log segment and check the last time index entry.". The existing logic
> > > searches from newest segment backwards. Is there a reason why we are
> > > changing it?
> > >
> > Suppose segment 0 has max timestamp 100, segment 1 has max timestamp 50
> and
> > segment 3 has max timestamp 90, now user want to search for timestamp 80.
> > If we search backwards, we have to take a look at all the segments. If we
> > search forward, we will stop at the first segment whose max timestamp is
> > greater than 80 (i.e all the previous segments has smaller timestamps)
> and
> > start the finer search on that segment.
> >
> >
> > > 7. Do you mind if I fix typos and minor grammar issues directly in the
> > > wiki? It seems easier than doing that via email.
> > >
> > Not at all, thanks for help.
> >
> >
> > > Thanks,
> > > Ismael
> > >
> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I updated KIP-33 based on the initial implementation. Per discussion
> on
> > > > yesterday's KIP hangout, I would like to initiate the new vote thread
> > for
> > > > KIP-33.
> > > >
> > > > The KIP wiki:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > > >
> > > > Here is a brief summary of the KIP:
> > > > 1. We propose to add a time index for each log segment.
> > > > 2. The time indices are going to be used of log retention, log
> rolling
> > > and
> > > > message search by timestamp.
> > > >
> > > > There was an old voting thread which has some discussions on this
> KIP.
> > > The
> > > > mail thread link is following:
> > > >
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > > >
> > > > I have the following WIP patch for reference. It needs a few more
> unit
> > > > tests and documentation. Other than that it should run fine.
> > > >
> > > >
> > >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > >
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Ismael Juma <is...@juma.me.uk>.

Thanks Becket. I think it would be nice to update the KIP with regards to
point 3 and 4.

In any case, +1 (non-binding)

Ismael

On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <be...@gmail.com> wrote:

> Thanks for the comments Ismael. Please see the replies inline.
>
> On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk> wrote:
>
> > Hi Jiangjie,
> >
> > Thanks for the KIP, it's a nice improvement. Since it seems like we have
> > been using the voting thread for discussion, I'll do the same.
> >
> > A few minor comments/questions:
> >
> > 1. The proposed name for the time index file
> `SegmentBaseOffset.timeindex`.
> > Would `SegmentBaseOffset.time-index` be a little better? It would clearly
> > separate the type of index in case we add additional index types in the
> > future.
> >
> I have no strong opinion on this, I am not adding any thing separator
> because it is more regex friendly.
> I am not sure about the other indexes, time and space seems to be two most
> common dimensions.
>
> 2. When describing the time index entry, we say "Offset - the next offset
> > when the time index entry is inserted". I found the mention of `next` a
> bit
> > confusing as it looks to me like the time index entry has the first
> offset
> > in the message set.
>
> This semantic meaning is a little different from the offset index. The
> offset index information is self-contained by nature. i.e. all the offsets
> before is smaller than the offset of this message set. So we only need to
> say "the offset of this message set is OFFSET". This does not quite apply
> to the time index because the max timestamp may or may not be in the
> message set being appended. So we have to either say, "the max timestamp
> before I append this message set is T", or "the max timestamp after I
> appended this message set is T". The former case means that we can skip all
> the previous messages if we are looking for a timestamp > T and start from
> this offset. The latter one means if we are searching for timestamp > T, we
> should start after this message set, which is essentially the same as the
> former case but require an additional interpretation.
>
> 3. We say "The default initial / max size of the time index files is the
> > same as the offset index files. (time index entry is 1.5x of the size of
> > offset index entry, user should set the configuration accordingly)". It
> may
> > be worth elaborating a little on what a user should do with regards to
> this
> > configuration when upgrading (ie maybe under "Compatibility, Deprecation,
> > and Migration Plan").
> >
> Makes sense.
>
>
> > 4. In a previous vote thread, Jun said "The simplest thing is probably
> > to change
> > the default index size to 2MB to match the default log segment size" and
> > you seemed to agree. I couldn't find anything about this in the KIP. Are
> we
> > still doing it?
> >
> Yes, we can still make the change for default settings. User might want to
> set the index size a little larger if they have a customized size but in
> reality it should not cause problems other than rolling out a little more
> log segments.
>
> 5. We say "Instead, it is only monotonically increasing within each time
> > index file. i.e. It is possible that the time index file for a later log
> > segment contains smaller timestamp than some timestamp in the time index
> > file of an earlier segment.". I think it would be good to explain under
> > which scenario a time index file for a later log segment contains a
> smaller
> > timestamp (is this only when CreateTime is used?).
> >
> Yes, it only happens when CreateTime is used.
>
>
> > 6. We say "When searching by timestamp, broker will start from the
> earliest
> > log segment and check the last time index entry.". The existing logic
> > searches from newest segment backwards. Is there a reason why we are
> > changing it?
> >
> Suppose segment 0 has max timestamp 100, segment 1 has max timestamp 50 and
> segment 3 has max timestamp 90, now user want to search for timestamp 80.
> If we search backwards, we have to take a look at all the segments. If we
> search forward, we will stop at the first segment whose max timestamp is
> greater than 80 (i.e all the previous segments has smaller timestamps) and
> start the finer search on that segment.
>
>
> > 7. Do you mind if I fix typos and minor grammar issues directly in the
> > wiki? It seems easier than doing that via email.
> >
> Not at all, thanks for help.
>
>
> > Thanks,
> > Ismael
> >
> > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I updated KIP-33 based on the initial implementation. Per discussion on
> > > yesterday's KIP hangout, I would like to initiate the new vote thread
> for
> > > KIP-33.
> > >
> > > The KIP wiki:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > >
> > > Here is a brief summary of the KIP:
> > > 1. We propose to add a time index for each log segment.
> > > 2. The time indices are going to be used of log retention, log rolling
> > and
> > > message search by timestamp.
> > >
> > > There was an old voting thread which has some discussions on this KIP.
> > The
> > > mail thread link is following:
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > >
> > > I have the following WIP patch for reference. It needs a few more unit
> > > tests and documentation. Other than that it should run fine.
> > >
> > >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Thanks for the comments Ismael. Please see the replies inline.

On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <is...@juma.me.uk> wrote:

> Hi Jiangjie,
>
> Thanks for the KIP, it's a nice improvement. Since it seems like we have
> been using the voting thread for discussion, I'll do the same.
>
> A few minor comments/questions:
>
> 1. The proposed name for the time index file `SegmentBaseOffset.timeindex`.
> Would `SegmentBaseOffset.time-index` be a little better? It would clearly
> separate the type of index in case we add additional index types in the
> future.
>
I have no strong opinion on this, I am not adding any thing separator
because it is more regex friendly.
I am not sure about the other indexes, time and space seems to be two most
common dimensions.

2. When describing the time index entry, we say "Offset - the next offset
> when the time index entry is inserted". I found the mention of `next` a bit
> confusing as it looks to me like the time index entry has the first offset
> in the message set.

This semantic meaning is a little different from the offset index. The
offset index information is self-contained by nature. i.e. all the offsets
before is smaller than the offset of this message set. So we only need to
say "the offset of this message set is OFFSET". This does not quite apply
to the time index because the max timestamp may or may not be in the
message set being appended. So we have to either say, "the max timestamp
before I append this message set is T", or "the max timestamp after I
appended this message set is T". The former case means that we can skip all
the previous messages if we are looking for a timestamp > T and start from
this offset. The latter one means if we are searching for timestamp > T, we
should start after this message set, which is essentially the same as the
former case but require an additional interpretation.

3. We say "The default initial / max size of the time index files is the
> same as the offset index files. (time index entry is 1.5x of the size of
> offset index entry, user should set the configuration accordingly)". It may
> be worth elaborating a little on what a user should do with regards to this
> configuration when upgrading (ie maybe under "Compatibility, Deprecation,
> and Migration Plan").
>
Makes sense.


> 4. In a previous vote thread, Jun said "The simplest thing is probably
> to change
> the default index size to 2MB to match the default log segment size" and
> you seemed to agree. I couldn't find anything about this in the KIP. Are we
> still doing it?
>
Yes, we can still make the change for default settings. User might want to
set the index size a little larger if they have a customized size but in
reality it should not cause problems other than rolling out a little more
log segments.

5. We say "Instead, it is only monotonically increasing within each time
> index file. i.e. It is possible that the time index file for a later log
> segment contains smaller timestamp than some timestamp in the time index
> file of an earlier segment.". I think it would be good to explain under
> which scenario a time index file for a later log segment contains a smaller
> timestamp (is this only when CreateTime is used?).
>
Yes, it only happens when CreateTime is used.


> 6. We say "When searching by timestamp, broker will start from the earliest
> log segment and check the last time index entry.". The existing logic
> searches from newest segment backwards. Is there a reason why we are
> changing it?
>
Suppose segment 0 has max timestamp 100, segment 1 has max timestamp 50 and
segment 3 has max timestamp 90, now user want to search for timestamp 80.
If we search backwards, we have to take a look at all the segments. If we
search forward, we will stop at the first segment whose max timestamp is
greater than 80 (i.e all the previous segments has smaller timestamps) and
start the finer search on that segment.


> 7. Do you mind if I fix typos and minor grammar issues directly in the
> wiki? It seems easier than doing that via email.
>
Not at all, thanks for help.


> Thanks,
> Ismael
>
> On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi all,
> >
> > I updated KIP-33 based on the initial implementation. Per discussion on
> > yesterday's KIP hangout, I would like to initiate the new vote thread for
> > KIP-33.
> >
> > The KIP wiki:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> >
> > Here is a brief summary of the KIP:
> > 1. We propose to add a time index for each log segment.
> > 2. The time indices are going to be used of log retention, log rolling
> and
> > message search by timestamp.
> >
> > There was an old voting thread which has some discussions on this KIP.
> The
> > mail thread link is following:
> >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> >
> > I have the following WIP patch for reference. It needs a few more unit
> > tests and documentation. Other than that it should run fine.
> >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Ismael Juma <is...@juma.me.uk>.

Hi Jiangjie,

Thanks for the KIP, it's a nice improvement. Since it seems like we have
been using the voting thread for discussion, I'll do the same.

A few minor comments/questions:

1. The proposed name for the time index file `SegmentBaseOffset.timeindex`.
Would `SegmentBaseOffset.time-index` be a little better? It would clearly
separate the type of index in case we add additional index types in the
future.
2. When describing the time index entry, we say "Offset - the next offset
when the time index entry is inserted". I found the mention of `next` a bit
confusing as it looks to me like the time index entry has the first offset
in the message set.
3. We say "The default initial / max size of the time index files is the
same as the offset index files. (time index entry is 1.5x of the size of
offset index entry, user should set the configuration accordingly)". It may
be worth elaborating a little on what a user should do with regards to this
configuration when upgrading (ie maybe under "Compatibility, Deprecation,
and Migration Plan").
4. In a previous vote thread, Jun said "The simplest thing is probably
to change
the default index size to 2MB to match the default log segment size" and
you seemed to agree. I couldn't find anything about this in the KIP. Are we
still doing it?
5. We say "Instead, it is only monotonically increasing within each time
index file. i.e. It is possible that the time index file for a later log
segment contains smaller timestamp than some timestamp in the time index
file of an earlier segment.". I think it would be good to explain under
which scenario a time index file for a later log segment contains a smaller
timestamp (is this only when CreateTime is used?).
6. We say "When searching by timestamp, broker will start from the earliest
log segment and check the last time index entry.". The existing logic
searches from newest segment backwards. Is there a reason why we are
changing it?
7. Do you mind if I fix typos and minor grammar issues directly in the
wiki? It seems easier than doing that via email.

Thanks,
Ismael

On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <be...@gmail.com> wrote:

> Hi all,
>
> I updated KIP-33 based on the initial implementation. Per discussion on
> yesterday's KIP hangout, I would like to initiate the new vote thread for
> KIP-33.
>
> The KIP wiki:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>
> Here is a brief summary of the KIP:
> 1. We propose to add a time index for each log segment.
> 2. The time indices are going to be used of log retention, log rolling and
> message search by timestamp.
>
> There was an old voting thread which has some discussions on this KIP. The
> mail thread link is following:
>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
>
> I have the following WIP patch for reference. It needs a few more unit
> tests and documentation. Other than that it should run fine.
>
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
>
> Thanks,
>
> Jiangjie (Becket) Qin
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Guozhang Wang <wa...@gmail.com>.

Great job Jiangjie.

A few comments:

1. "if an offset index entry is inserted, it will also insert a time index
entry" what is the motivation for co-inserting offset index and timestamp
index? Is it just for simplicity or are there any other considerations?

2. "Search message by timestamp": you may want to describe the behavior
from the handling of the offset request as well, i.e. giving the timestamp
returns an offset where messages before this offset are guaranteed to have
smaller timestamps.

Guozhang

On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com> wrote:

> Hi all,
>
> I updated KIP-33 based on the initial implementation. Per discussion on
> yesterday's KIP hangout, I would like to initiate the new vote thread for
> KIP-33.
>
> The KIP wiki:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>
> Here is a brief summary of the KIP:
> 1. We propose to add a time index for each log segment.
> 2. The time indices are going to be used of log retention, log rolling and
> message search by timestamp.
>
> There was an old voting thread which has some discussions on this KIP. The
> mail thread link is following:
>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
>
> I have the following WIP patch for reference. It needs a few more unit
> tests and documentation. Other than that it should run fine.
>
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
>
> Thanks,
>
> Jiangjie (Becket) Qin
>

-- 
-- Guozhang

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Liquan Pei <li...@gmail.com>.

+1

On Sat, Apr 16, 2016 at 10:25 PM, Gwen Shapira <gw...@confluent.io> wrote:

> +1
>
> On Fri, Apr 15, 2016 at 9:37 AM, Guozhang Wang <wa...@gmail.com> wrote:
> > +1 from me. Thanks.
> >
> > On Fri, Apr 15, 2016 at 9:16 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> >> Hi, Jiangjie,
> >>
> >> Thanks for the latest update. +1 on the KIP.
> >>
> >> Jun
> >>
> >> On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin <be...@gmail.com>
> wrote:
> >>
> >> > Hi Jun,
> >> >
> >> > 11. Yes, that sounds a reasonable solution. In the latest patch I am
> >> doing
> >> > the following in order:
> >> > a. Create an empty time index for a log segment if there isn't one.
> >> > b. For all non-active log segments, append an entry of
> >> > (last_modification_time -> next_base_offset) into the index. The time
> >> index
> >> > of the active segment is left empty. All the in-memory maxTimestamp is
> >> set
> >> > to -1.
> >> > c. If there is a hard failure, the time index and offset index will
> both
> >> be
> >> > rebuilt.
> >> >
> >> > So we do not rebuild the time index during upgrade unless there was a
> >> hard
> >> > failure. I have updated the wiki to reflect this.
> >> >
> >> > BTW, it seems that the current code will never hit the case where an
> >> index
> >> > is missing. I commented on PR.
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >> >
> >> >
> >> > On Thu, Apr 14, 2016 at 10:00 AM, Jun Rao <ju...@confluent.io> wrote:
> >> >
> >> > > Hi, Jiangjie,
> >> > >
> >> > > 11. Rebuilding all missing time indexes will make the upgrade
> process
> >> > > longer since the log segments pre 0.10.0 don't have the time index.
> >> Could
> >> > > we only rebuild the missing indexes after the last flush offset? For
> >> > other
> >> > > segments missing the time index, we just assume lastTimestamp to be
> -1?
> >> > >
> >> > > Thanks,
> >> > >
> >> > > Jun
> >> > >
> >> > > On Wed, Apr 13, 2016 at 9:55 AM, Becket Qin <be...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Hi Jun and Guozhang,
> >> > > >
> >> > > > I have updated the KIP wiki to incorporate your comments. Please
> let
> >> me
> >> > > > know if you prefer starting another discussion thread for further
> >> > > > discussion.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jiangjie (Becket) Qin
> >> > > >
> >> > > > On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <
> becket.qin@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > Hi Guozhang and Jun,
> >> > > > >
> >> > > > > Thanks for the comments. Please see the responses below.
> >> > > > >
> >> > > > > Regarding to Guozhang's question #1 and Jun's question #12. I
> was
> >> > > > > inserting the time index and offset index entry together mostly
> for
> >> > > > > simplicity as Guozhang mentioned. The purpose of using index
> >> interval
> >> > > > bytes
> >> > > > > for time index was to control the density of the time index,
> which
> >> is
> >> > > the
> >> > > > > same purpose as offset index. It seems reasonable to make them
> >> > aligned.
> >> > > > We
> >> > > > > can track separately the physical position when we insert the
> last
> >> > time
> >> > > > > index entry(my original code did that), but when I wrote the
> code I
> >> > > feel
> >> > > > it
> >> > > > > seems unnecessary. Another minor benefit is that searching by
> >> > timestamp
> >> > > > > could be potentially faster if we align the time index and
> offset
> >> > > index.
> >> > > > > It is possible that we only have either a corrupted time index
> or
> >> an
> >> > > > > offset index, but not both. Although we can choose to only
> rebuild
> >> > the
> >> > > > one
> >> > > > > which is corrupted, given that we have to scan the entire log
> >> segment
> >> > > > > anyway, rebuilding both of them seems not much overhead. So the
> >> > current
> >> > > > > patch I have is rebuilding both of them together.
> >> > > > >
> >> > > > > 10. Yes, it should only happen after a hard failure. The last
> time
> >> > > index
> >> > > > > entry of a normally closed segment has already points to the
> LEO,
> >> so
> >> > > > there
> >> > > > > is no scan during start up.
> >> > > > >
> >> > > > > 11. On broker startup, if a time index does not exist, an empty
> one
> >> > > will
> >> > > > > be created first. If message format version is 0.9.0, we will
> >> append
> >> > a
> >> > > > time
> >> > > > > index entry of (last modification time -> base offset of next
> >> > segment)
> >> > > to
> >> > > > > the time index of each inactive segment. So no actual rebuild
> will
> >> > > happen
> >> > > > > during upgrade. However, if message format version is 0.10.0, we
> >> will
> >> > > > > rebuild the time index if it does not exist. (I actually had a
> >> > question
> >> > > > > about the how we are loading the log segments, we can discuss
> it in
> >> > the
> >> > > > PR)
> >> > > > >
> >> > > > > I will update the wiki to clarify the question raised in the
> >> comments
> >> > > and
> >> > > > > submit a PR by tomorrow. I am currently cleaning up the
> >> > documentation.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jiangjie (Becket) Qin
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io>
> wrote:
> >> > > > >
> >> > > > >> Hi, Jiangjie,
> >> > > > >>
> >> > > > >> Thanks for the update. Looks good to me overall. Just a few
> minor
> >> > > > comments
> >> > > > >> below.
> >> > > > >>
> >> > > > >> 10. On broker startup, it's not clear to me why we need to scan
> >> the
> >> > > log
> >> > > > >> segment to retrieve the largest timestamp since the time index
> >> > always
> >> > > > has
> >> > > > >> an entry for the largest timestamp. Is that only for restarting
> >> > after
> >> > > a
> >> > > > >> hard failure?
> >> > > > >>
> >> > > > >> 11. On broker startup, if a log segment misses the time index,
> do
> >> we
> >> > > > >> always
> >> > > > >> rebuild it? This can happen when the broker is upgraded.
> >> > > > >>
> >> > > > >> 12. Related to Guozhang's question #1. It seems it's simpler to
> >> add
> >> > > time
> >> > > > >> index entries independent of the offset index since at index
> entry
> >> > may
> >> > > > not
> >> > > > >> be added to the offset and the time index at the same time.
> Also,
> >> > this
> >> > > > >> allows time index to be rebuilt independently if needed.
> >> > > > >>
> >> > > > >> Thanks,
> >> > > > >>
> >> > > > >> Jun
> >> > > > >>
> >> > > > >>
> >> > > > >> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <
> becket.qin@gmail.com>
> >> > > > wrote:
> >> > > > >>
> >> > > > >> > Hi all,
> >> > > > >> >
> >> > > > >> > I updated KIP-33 based on the initial implementation. Per
> >> > discussion
> >> > > > on
> >> > > > >> > yesterday's KIP hangout, I would like to initiate the new
> vote
> >> > > thread
> >> > > > >> for
> >> > > > >> > KIP-33.
> >> > > > >> >
> >> > > > >> > The KIP wiki:
> >> > > > >> >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> >> > > > >> >
> >> > > > >> > Here is a brief summary of the KIP:
> >> > > > >> > 1. We propose to add a time index for each log segment.
> >> > > > >> > 2. The time indices are going to be used of log retention,
> log
> >> > > rolling
> >> > > > >> and
> >> > > > >> > message search by timestamp.
> >> > > > >> >
> >> > > > >> > There was an old voting thread which has some discussions on
> >> this
> >> > > KIP.
> >> > > > >> The
> >> > > > >> > mail thread link is following:
> >> > > > >> >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> >> > > > >> >
> >> > > > >> > I have the following WIP patch for reference. It needs a few
> >> more
> >> > > unit
> >> > > > >> > tests and documentation. Other than that it should run fine.
> >> > > > >> >
> >> > > > >> >
> >> > > > >>
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> >> > > > >> >
> >> > > > >> > Thanks,
> >> > > > >> >
> >> > > > >> > Jiangjie (Becket) Qin
> >> > > > >> >
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> >
> > --
> > -- Guozhang
>



-- 
Liquan Pei
Software Engineer, Confluent Inc

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Gwen Shapira <gw...@confluent.io>.

+1

On Fri, Apr 15, 2016 at 9:37 AM, Guozhang Wang <wa...@gmail.com> wrote:
> +1 from me. Thanks.
>
> On Fri, Apr 15, 2016 at 9:16 AM, Jun Rao <ju...@confluent.io> wrote:
>
>> Hi, Jiangjie,
>>
>> Thanks for the latest update. +1 on the KIP.
>>
>> Jun
>>
>> On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin <be...@gmail.com> wrote:
>>
>> > Hi Jun,
>> >
>> > 11. Yes, that sounds a reasonable solution. In the latest patch I am
>> doing
>> > the following in order:
>> > a. Create an empty time index for a log segment if there isn't one.
>> > b. For all non-active log segments, append an entry of
>> > (last_modification_time -> next_base_offset) into the index. The time
>> index
>> > of the active segment is left empty. All the in-memory maxTimestamp is
>> set
>> > to -1.
>> > c. If there is a hard failure, the time index and offset index will both
>> be
>> > rebuilt.
>> >
>> > So we do not rebuild the time index during upgrade unless there was a
>> hard
>> > failure. I have updated the wiki to reflect this.
>> >
>> > BTW, it seems that the current code will never hit the case where an
>> index
>> > is missing. I commented on PR.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> >
>> >
>> > On Thu, Apr 14, 2016 at 10:00 AM, Jun Rao <ju...@confluent.io> wrote:
>> >
>> > > Hi, Jiangjie,
>> > >
>> > > 11. Rebuilding all missing time indexes will make the upgrade process
>> > > longer since the log segments pre 0.10.0 don't have the time index.
>> Could
>> > > we only rebuild the missing indexes after the last flush offset? For
>> > other
>> > > segments missing the time index, we just assume lastTimestamp to be -1?
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Wed, Apr 13, 2016 at 9:55 AM, Becket Qin <be...@gmail.com>
>> > wrote:
>> > >
>> > > > Hi Jun and Guozhang,
>> > > >
>> > > > I have updated the KIP wiki to incorporate your comments. Please let
>> me
>> > > > know if you prefer starting another discussion thread for further
>> > > > discussion.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > > > On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <be...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi Guozhang and Jun,
>> > > > >
>> > > > > Thanks for the comments. Please see the responses below.
>> > > > >
>> > > > > Regarding to Guozhang's question #1 and Jun's question #12. I was
>> > > > > inserting the time index and offset index entry together mostly for
>> > > > > simplicity as Guozhang mentioned. The purpose of using index
>> interval
>> > > > bytes
>> > > > > for time index was to control the density of the time index, which
>> is
>> > > the
>> > > > > same purpose as offset index. It seems reasonable to make them
>> > aligned.
>> > > > We
>> > > > > can track separately the physical position when we insert the last
>> > time
>> > > > > index entry(my original code did that), but when I wrote the code I
>> > > feel
>> > > > it
>> > > > > seems unnecessary. Another minor benefit is that searching by
>> > timestamp
>> > > > > could be potentially faster if we align the time index and offset
>> > > index.
>> > > > > It is possible that we only have either a corrupted time index or
>> an
>> > > > > offset index, but not both. Although we can choose to only rebuild
>> > the
>> > > > one
>> > > > > which is corrupted, given that we have to scan the entire log
>> segment
>> > > > > anyway, rebuilding both of them seems not much overhead. So the
>> > current
>> > > > > patch I have is rebuilding both of them together.
>> > > > >
>> > > > > 10. Yes, it should only happen after a hard failure. The last time
>> > > index
>> > > > > entry of a normally closed segment has already points to the LEO,
>> so
>> > > > there
>> > > > > is no scan during start up.
>> > > > >
>> > > > > 11. On broker startup, if a time index does not exist, an empty one
>> > > will
>> > > > > be created first. If message format version is 0.9.0, we will
>> append
>> > a
>> > > > time
>> > > > > index entry of (last modification time -> base offset of next
>> > segment)
>> > > to
>> > > > > the time index of each inactive segment. So no actual rebuild will
>> > > happen
>> > > > > during upgrade. However, if message format version is 0.10.0, we
>> will
>> > > > > rebuild the time index if it does not exist. (I actually had a
>> > question
>> > > > > about the how we are loading the log segments, we can discuss it in
>> > the
>> > > > PR)
>> > > > >
>> > > > > I will update the wiki to clarify the question raised in the
>> comments
>> > > and
>> > > > > submit a PR by tomorrow. I am currently cleaning up the
>> > documentation.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jiangjie (Becket) Qin
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:
>> > > > >
>> > > > >> Hi, Jiangjie,
>> > > > >>
>> > > > >> Thanks for the update. Looks good to me overall. Just a few minor
>> > > > comments
>> > > > >> below.
>> > > > >>
>> > > > >> 10. On broker startup, it's not clear to me why we need to scan
>> the
>> > > log
>> > > > >> segment to retrieve the largest timestamp since the time index
>> > always
>> > > > has
>> > > > >> an entry for the largest timestamp. Is that only for restarting
>> > after
>> > > a
>> > > > >> hard failure?
>> > > > >>
>> > > > >> 11. On broker startup, if a log segment misses the time index, do
>> we
>> > > > >> always
>> > > > >> rebuild it? This can happen when the broker is upgraded.
>> > > > >>
>> > > > >> 12. Related to Guozhang's question #1. It seems it's simpler to
>> add
>> > > time
>> > > > >> index entries independent of the offset index since at index entry
>> > may
>> > > > not
>> > > > >> be added to the offset and the time index at the same time. Also,
>> > this
>> > > > >> allows time index to be rebuilt independently if needed.
>> > > > >>
>> > > > >> Thanks,
>> > > > >>
>> > > > >> Jun
>> > > > >>
>> > > > >>
>> > > > >> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com>
>> > > > wrote:
>> > > > >>
>> > > > >> > Hi all,
>> > > > >> >
>> > > > >> > I updated KIP-33 based on the initial implementation. Per
>> > discussion
>> > > > on
>> > > > >> > yesterday's KIP hangout, I would like to initiate the new vote
>> > > thread
>> > > > >> for
>> > > > >> > KIP-33.
>> > > > >> >
>> > > > >> > The KIP wiki:
>> > > > >> >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>> > > > >> >
>> > > > >> > Here is a brief summary of the KIP:
>> > > > >> > 1. We propose to add a time index for each log segment.
>> > > > >> > 2. The time indices are going to be used of log retention, log
>> > > rolling
>> > > > >> and
>> > > > >> > message search by timestamp.
>> > > > >> >
>> > > > >> > There was an old voting thread which has some discussions on
>> this
>> > > KIP.
>> > > > >> The
>> > > > >> > mail thread link is following:
>> > > > >> >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
>> > > > >> >
>> > > > >> > I have the following WIP patch for reference. It needs a few
>> more
>> > > unit
>> > > > >> > tests and documentation. Other than that it should run fine.
>> > > > >> >
>> > > > >> >
>> > > > >>
>> > > >
>> > >
>> >
>> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
>> > > > >> >
>> > > > >> > Thanks,
>> > > > >> >
>> > > > >> > Jiangjie (Becket) Qin
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>
>
> --
> -- Guozhang

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Guozhang Wang <wa...@gmail.com>.

+1 from me. Thanks.

On Fri, Apr 15, 2016 at 9:16 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Jiangjie,
>
> Thanks for the latest update. +1 on the KIP.
>
> Jun
>
> On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi Jun,
> >
> > 11. Yes, that sounds a reasonable solution. In the latest patch I am
> doing
> > the following in order:
> > a. Create an empty time index for a log segment if there isn't one.
> > b. For all non-active log segments, append an entry of
> > (last_modification_time -> next_base_offset) into the index. The time
> index
> > of the active segment is left empty. All the in-memory maxTimestamp is
> set
> > to -1.
> > c. If there is a hard failure, the time index and offset index will both
> be
> > rebuilt.
> >
> > So we do not rebuild the time index during upgrade unless there was a
> hard
> > failure. I have updated the wiki to reflect this.
> >
> > BTW, it seems that the current code will never hit the case where an
> index
> > is missing. I commented on PR.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Thu, Apr 14, 2016 at 10:00 AM, Jun Rao <ju...@confluent.io> wrote:
> >
> > > Hi, Jiangjie,
> > >
> > > 11. Rebuilding all missing time indexes will make the upgrade process
> > > longer since the log segments pre 0.10.0 don't have the time index.
> Could
> > > we only rebuild the missing indexes after the last flush offset? For
> > other
> > > segments missing the time index, we just assume lastTimestamp to be -1?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Wed, Apr 13, 2016 at 9:55 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > > > Hi Jun and Guozhang,
> > > >
> > > > I have updated the KIP wiki to incorporate your comments. Please let
> me
> > > > know if you prefer starting another discussion thread for further
> > > > discussion.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Guozhang and Jun,
> > > > >
> > > > > Thanks for the comments. Please see the responses below.
> > > > >
> > > > > Regarding to Guozhang's question #1 and Jun's question #12. I was
> > > > > inserting the time index and offset index entry together mostly for
> > > > > simplicity as Guozhang mentioned. The purpose of using index
> interval
> > > > bytes
> > > > > for time index was to control the density of the time index, which
> is
> > > the
> > > > > same purpose as offset index. It seems reasonable to make them
> > aligned.
> > > > We
> > > > > can track separately the physical position when we insert the last
> > time
> > > > > index entry(my original code did that), but when I wrote the code I
> > > feel
> > > > it
> > > > > seems unnecessary. Another minor benefit is that searching by
> > timestamp
> > > > > could be potentially faster if we align the time index and offset
> > > index.
> > > > > It is possible that we only have either a corrupted time index or
> an
> > > > > offset index, but not both. Although we can choose to only rebuild
> > the
> > > > one
> > > > > which is corrupted, given that we have to scan the entire log
> segment
> > > > > anyway, rebuilding both of them seems not much overhead. So the
> > current
> > > > > patch I have is rebuilding both of them together.
> > > > >
> > > > > 10. Yes, it should only happen after a hard failure. The last time
> > > index
> > > > > entry of a normally closed segment has already points to the LEO,
> so
> > > > there
> > > > > is no scan during start up.
> > > > >
> > > > > 11. On broker startup, if a time index does not exist, an empty one
> > > will
> > > > > be created first. If message format version is 0.9.0, we will
> append
> > a
> > > > time
> > > > > index entry of (last modification time -> base offset of next
> > segment)
> > > to
> > > > > the time index of each inactive segment. So no actual rebuild will
> > > happen
> > > > > during upgrade. However, if message format version is 0.10.0, we
> will
> > > > > rebuild the time index if it does not exist. (I actually had a
> > question
> > > > > about the how we are loading the log segments, we can discuss it in
> > the
> > > > PR)
> > > > >
> > > > > I will update the wiki to clarify the question raised in the
> comments
> > > and
> > > > > submit a PR by tomorrow. I am currently cleaning up the
> > documentation.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:
> > > > >
> > > > >> Hi, Jiangjie,
> > > > >>
> > > > >> Thanks for the update. Looks good to me overall. Just a few minor
> > > > comments
> > > > >> below.
> > > > >>
> > > > >> 10. On broker startup, it's not clear to me why we need to scan
> the
> > > log
> > > > >> segment to retrieve the largest timestamp since the time index
> > always
> > > > has
> > > > >> an entry for the largest timestamp. Is that only for restarting
> > after
> > > a
> > > > >> hard failure?
> > > > >>
> > > > >> 11. On broker startup, if a log segment misses the time index, do
> we
> > > > >> always
> > > > >> rebuild it? This can happen when the broker is upgraded.
> > > > >>
> > > > >> 12. Related to Guozhang's question #1. It seems it's simpler to
> add
> > > time
> > > > >> index entries independent of the offset index since at index entry
> > may
> > > > not
> > > > >> be added to the offset and the time index at the same time. Also,
> > this
> > > > >> allows time index to be rebuilt independently if needed.
> > > > >>
> > > > >> Thanks,
> > > > >>
> > > > >> Jun
> > > > >>
> > > > >>
> > > > >> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >> > Hi all,
> > > > >> >
> > > > >> > I updated KIP-33 based on the initial implementation. Per
> > discussion
> > > > on
> > > > >> > yesterday's KIP hangout, I would like to initiate the new vote
> > > thread
> > > > >> for
> > > > >> > KIP-33.
> > > > >> >
> > > > >> > The KIP wiki:
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > > > >> >
> > > > >> > Here is a brief summary of the KIP:
> > > > >> > 1. We propose to add a time index for each log segment.
> > > > >> > 2. The time indices are going to be used of log retention, log
> > > rolling
> > > > >> and
> > > > >> > message search by timestamp.
> > > > >> >
> > > > >> > There was an old voting thread which has some discussions on
> this
> > > KIP.
> > > > >> The
> > > > >> > mail thread link is following:
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > > > >> >
> > > > >> > I have the following WIP patch for reference. It needs a few
> more
> > > unit
> > > > >> > tests and documentation. Other than that it should run fine.
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jiangjie (Becket) Qin
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>



-- 
-- Guozhang

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Jun Rao <ju...@confluent.io>.

Hi, Jiangjie,

Thanks for the latest update. +1 on the KIP.

Jun

On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin <be...@gmail.com> wrote:

> Hi Jun,
>
> 11. Yes, that sounds a reasonable solution. In the latest patch I am doing
> the following in order:
> a. Create an empty time index for a log segment if there isn't one.
> b. For all non-active log segments, append an entry of
> (last_modification_time -> next_base_offset) into the index. The time index
> of the active segment is left empty. All the in-memory maxTimestamp is set
> to -1.
> c. If there is a hard failure, the time index and offset index will both be
> rebuilt.
>
> So we do not rebuild the time index during upgrade unless there was a hard
> failure. I have updated the wiki to reflect this.
>
> BTW, it seems that the current code will never hit the case where an index
> is missing. I commented on PR.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Thu, Apr 14, 2016 at 10:00 AM, Jun Rao <ju...@confluent.io> wrote:
>
> > Hi, Jiangjie,
> >
> > 11. Rebuilding all missing time indexes will make the upgrade process
> > longer since the log segments pre 0.10.0 don't have the time index. Could
> > we only rebuild the missing indexes after the last flush offset? For
> other
> > segments missing the time index, we just assume lastTimestamp to be -1?
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Apr 13, 2016 at 9:55 AM, Becket Qin <be...@gmail.com>
> wrote:
> >
> > > Hi Jun and Guozhang,
> > >
> > > I have updated the KIP wiki to incorporate your comments. Please let me
> > > know if you prefer starting another discussion thread for further
> > > discussion.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <be...@gmail.com>
> > wrote:
> > >
> > > > Hi Guozhang and Jun,
> > > >
> > > > Thanks for the comments. Please see the responses below.
> > > >
> > > > Regarding to Guozhang's question #1 and Jun's question #12. I was
> > > > inserting the time index and offset index entry together mostly for
> > > > simplicity as Guozhang mentioned. The purpose of using index interval
> > > bytes
> > > > for time index was to control the density of the time index, which is
> > the
> > > > same purpose as offset index. It seems reasonable to make them
> aligned.
> > > We
> > > > can track separately the physical position when we insert the last
> time
> > > > index entry(my original code did that), but when I wrote the code I
> > feel
> > > it
> > > > seems unnecessary. Another minor benefit is that searching by
> timestamp
> > > > could be potentially faster if we align the time index and offset
> > index.
> > > > It is possible that we only have either a corrupted time index or an
> > > > offset index, but not both. Although we can choose to only rebuild
> the
> > > one
> > > > which is corrupted, given that we have to scan the entire log segment
> > > > anyway, rebuilding both of them seems not much overhead. So the
> current
> > > > patch I have is rebuilding both of them together.
> > > >
> > > > 10. Yes, it should only happen after a hard failure. The last time
> > index
> > > > entry of a normally closed segment has already points to the LEO, so
> > > there
> > > > is no scan during start up.
> > > >
> > > > 11. On broker startup, if a time index does not exist, an empty one
> > will
> > > > be created first. If message format version is 0.9.0, we will append
> a
> > > time
> > > > index entry of (last modification time -> base offset of next
> segment)
> > to
> > > > the time index of each inactive segment. So no actual rebuild will
> > happen
> > > > during upgrade. However, if message format version is 0.10.0, we will
> > > > rebuild the time index if it does not exist. (I actually had a
> question
> > > > about the how we are loading the log segments, we can discuss it in
> the
> > > PR)
> > > >
> > > > I will update the wiki to clarify the question raised in the comments
> > and
> > > > submit a PR by tomorrow. I am currently cleaning up the
> documentation.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > >
> > > >
> > > > On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:
> > > >
> > > >> Hi, Jiangjie,
> > > >>
> > > >> Thanks for the update. Looks good to me overall. Just a few minor
> > > comments
> > > >> below.
> > > >>
> > > >> 10. On broker startup, it's not clear to me why we need to scan the
> > log
> > > >> segment to retrieve the largest timestamp since the time index
> always
> > > has
> > > >> an entry for the largest timestamp. Is that only for restarting
> after
> > a
> > > >> hard failure?
> > > >>
> > > >> 11. On broker startup, if a log segment misses the time index, do we
> > > >> always
> > > >> rebuild it? This can happen when the broker is upgraded.
> > > >>
> > > >> 12. Related to Guozhang's question #1. It seems it's simpler to add
> > time
> > > >> index entries independent of the offset index since at index entry
> may
> > > not
> > > >> be added to the offset and the time index at the same time. Also,
> this
> > > >> allows time index to be rebuilt independently if needed.
> > > >>
> > > >> Thanks,
> > > >>
> > > >> Jun
> > > >>
> > > >>
> > > >> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com>
> > > wrote:
> > > >>
> > > >> > Hi all,
> > > >> >
> > > >> > I updated KIP-33 based on the initial implementation. Per
> discussion
> > > on
> > > >> > yesterday's KIP hangout, I would like to initiate the new vote
> > thread
> > > >> for
> > > >> > KIP-33.
> > > >> >
> > > >> > The KIP wiki:
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > > >> >
> > > >> > Here is a brief summary of the KIP:
> > > >> > 1. We propose to add a time index for each log segment.
> > > >> > 2. The time indices are going to be used of log retention, log
> > rolling
> > > >> and
> > > >> > message search by timestamp.
> > > >> >
> > > >> > There was an old voting thread which has some discussions on this
> > KIP.
> > > >> The
> > > >> > mail thread link is following:
> > > >> >
> > > >> >
> > > >>
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > > >> >
> > > >> > I have the following WIP patch for reference. It needs a few more
> > unit
> > > >> > tests and documentation. Other than that it should run fine.
> > > >> >
> > > >> >
> > > >>
> > >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Hi Jun,

11. Yes, that sounds a reasonable solution. In the latest patch I am doing
the following in order:
a. Create an empty time index for a log segment if there isn't one.
b. For all non-active log segments, append an entry of
(last_modification_time -> next_base_offset) into the index. The time index
of the active segment is left empty. All the in-memory maxTimestamp is set
to -1.
c. If there is a hard failure, the time index and offset index will both be
rebuilt.

So we do not rebuild the time index during upgrade unless there was a hard
failure. I have updated the wiki to reflect this.

BTW, it seems that the current code will never hit the case where an index
is missing. I commented on PR.

Thanks,

Jiangjie (Becket) Qin



On Thu, Apr 14, 2016 at 10:00 AM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Jiangjie,
>
> 11. Rebuilding all missing time indexes will make the upgrade process
> longer since the log segments pre 0.10.0 don't have the time index. Could
> we only rebuild the missing indexes after the last flush offset? For other
> segments missing the time index, we just assume lastTimestamp to be -1?
>
> Thanks,
>
> Jun
>
> On Wed, Apr 13, 2016 at 9:55 AM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi Jun and Guozhang,
> >
> > I have updated the KIP wiki to incorporate your comments. Please let me
> > know if you prefer starting another discussion thread for further
> > discussion.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <be...@gmail.com>
> wrote:
> >
> > > Hi Guozhang and Jun,
> > >
> > > Thanks for the comments. Please see the responses below.
> > >
> > > Regarding to Guozhang's question #1 and Jun's question #12. I was
> > > inserting the time index and offset index entry together mostly for
> > > simplicity as Guozhang mentioned. The purpose of using index interval
> > bytes
> > > for time index was to control the density of the time index, which is
> the
> > > same purpose as offset index. It seems reasonable to make them aligned.
> > We
> > > can track separately the physical position when we insert the last time
> > > index entry(my original code did that), but when I wrote the code I
> feel
> > it
> > > seems unnecessary. Another minor benefit is that searching by timestamp
> > > could be potentially faster if we align the time index and offset
> index.
> > > It is possible that we only have either a corrupted time index or an
> > > offset index, but not both. Although we can choose to only rebuild the
> > one
> > > which is corrupted, given that we have to scan the entire log segment
> > > anyway, rebuilding both of them seems not much overhead. So the current
> > > patch I have is rebuilding both of them together.
> > >
> > > 10. Yes, it should only happen after a hard failure. The last time
> index
> > > entry of a normally closed segment has already points to the LEO, so
> > there
> > > is no scan during start up.
> > >
> > > 11. On broker startup, if a time index does not exist, an empty one
> will
> > > be created first. If message format version is 0.9.0, we will append a
> > time
> > > index entry of (last modification time -> base offset of next segment)
> to
> > > the time index of each inactive segment. So no actual rebuild will
> happen
> > > during upgrade. However, if message format version is 0.10.0, we will
> > > rebuild the time index if it does not exist. (I actually had a question
> > > about the how we are loading the log segments, we can discuss it in the
> > PR)
> > >
> > > I will update the wiki to clarify the question raised in the comments
> and
> > > submit a PR by tomorrow. I am currently cleaning up the documentation.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > >
> > >
> > > On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:
> > >
> > >> Hi, Jiangjie,
> > >>
> > >> Thanks for the update. Looks good to me overall. Just a few minor
> > comments
> > >> below.
> > >>
> > >> 10. On broker startup, it's not clear to me why we need to scan the
> log
> > >> segment to retrieve the largest timestamp since the time index always
> > has
> > >> an entry for the largest timestamp. Is that only for restarting after
> a
> > >> hard failure?
> > >>
> > >> 11. On broker startup, if a log segment misses the time index, do we
> > >> always
> > >> rebuild it? This can happen when the broker is upgraded.
> > >>
> > >> 12. Related to Guozhang's question #1. It seems it's simpler to add
> time
> > >> index entries independent of the offset index since at index entry may
> > not
> > >> be added to the offset and the time index at the same time. Also, this
> > >> allows time index to be rebuilt independently if needed.
> > >>
> > >> Thanks,
> > >>
> > >> Jun
> > >>
> > >>
> > >> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com>
> > wrote:
> > >>
> > >> > Hi all,
> > >> >
> > >> > I updated KIP-33 based on the initial implementation. Per discussion
> > on
> > >> > yesterday's KIP hangout, I would like to initiate the new vote
> thread
> > >> for
> > >> > KIP-33.
> > >> >
> > >> > The KIP wiki:
> > >> >
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> > >> >
> > >> > Here is a brief summary of the KIP:
> > >> > 1. We propose to add a time index for each log segment.
> > >> > 2. The time indices are going to be used of log retention, log
> rolling
> > >> and
> > >> > message search by timestamp.
> > >> >
> > >> > There was an old voting thread which has some discussions on this
> KIP.
> > >> The
> > >> > mail thread link is following:
> > >> >
> > >> >
> > >>
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> > >> >
> > >> > I have the following WIP patch for reference. It needs a few more
> unit
> > >> > tests and documentation. Other than that it should run fine.
> > >> >
> > >> >
> > >>
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> > >> >
> > >> > Thanks,
> > >> >
> > >> > Jiangjie (Becket) Qin
> > >> >
> > >>
> > >
> > >
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Jun Rao <ju...@confluent.io>.

Hi, Jiangjie,

11. Rebuilding all missing time indexes will make the upgrade process
longer since the log segments pre 0.10.0 don't have the time index. Could
we only rebuild the missing indexes after the last flush offset? For other
segments missing the time index, we just assume lastTimestamp to be -1?

Thanks,

Jun

On Wed, Apr 13, 2016 at 9:55 AM, Becket Qin <be...@gmail.com> wrote:

> Hi Jun and Guozhang,
>
> I have updated the KIP wiki to incorporate your comments. Please let me
> know if you prefer starting another discussion thread for further
> discussion.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi Guozhang and Jun,
> >
> > Thanks for the comments. Please see the responses below.
> >
> > Regarding to Guozhang's question #1 and Jun's question #12. I was
> > inserting the time index and offset index entry together mostly for
> > simplicity as Guozhang mentioned. The purpose of using index interval
> bytes
> > for time index was to control the density of the time index, which is the
> > same purpose as offset index. It seems reasonable to make them aligned.
> We
> > can track separately the physical position when we insert the last time
> > index entry(my original code did that), but when I wrote the code I feel
> it
> > seems unnecessary. Another minor benefit is that searching by timestamp
> > could be potentially faster if we align the time index and offset index.
> > It is possible that we only have either a corrupted time index or an
> > offset index, but not both. Although we can choose to only rebuild the
> one
> > which is corrupted, given that we have to scan the entire log segment
> > anyway, rebuilding both of them seems not much overhead. So the current
> > patch I have is rebuilding both of them together.
> >
> > 10. Yes, it should only happen after a hard failure. The last time index
> > entry of a normally closed segment has already points to the LEO, so
> there
> > is no scan during start up.
> >
> > 11. On broker startup, if a time index does not exist, an empty one will
> > be created first. If message format version is 0.9.0, we will append a
> time
> > index entry of (last modification time -> base offset of next segment) to
> > the time index of each inactive segment. So no actual rebuild will happen
> > during upgrade. However, if message format version is 0.10.0, we will
> > rebuild the time index if it does not exist. (I actually had a question
> > about the how we are loading the log segments, we can discuss it in the
> PR)
> >
> > I will update the wiki to clarify the question raised in the comments and
> > submit a PR by tomorrow. I am currently cleaning up the documentation.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> >
> >
> > On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:
> >
> >> Hi, Jiangjie,
> >>
> >> Thanks for the update. Looks good to me overall. Just a few minor
> comments
> >> below.
> >>
> >> 10. On broker startup, it's not clear to me why we need to scan the log
> >> segment to retrieve the largest timestamp since the time index always
> has
> >> an entry for the largest timestamp. Is that only for restarting after a
> >> hard failure?
> >>
> >> 11. On broker startup, if a log segment misses the time index, do we
> >> always
> >> rebuild it? This can happen when the broker is upgraded.
> >>
> >> 12. Related to Guozhang's question #1. It seems it's simpler to add time
> >> index entries independent of the offset index since at index entry may
> not
> >> be added to the offset and the time index at the same time. Also, this
> >> allows time index to be rebuilt independently if needed.
> >>
> >> Thanks,
> >>
> >> Jun
> >>
> >>
> >> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com>
> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I updated KIP-33 based on the initial implementation. Per discussion
> on
> >> > yesterday's KIP hangout, I would like to initiate the new vote thread
> >> for
> >> > KIP-33.
> >> >
> >> > The KIP wiki:
> >> >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> >> >
> >> > Here is a brief summary of the KIP:
> >> > 1. We propose to add a time index for each log segment.
> >> > 2. The time indices are going to be used of log retention, log rolling
> >> and
> >> > message search by timestamp.
> >> >
> >> > There was an old voting thread which has some discussions on this KIP.
> >> The
> >> > mail thread link is following:
> >> >
> >> >
> >>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> >> >
> >> > I have the following WIP patch for reference. It needs a few more unit
> >> > tests and documentation. Other than that it should run fine.
> >> >
> >> >
> >>
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> >> >
> >> > Thanks,
> >> >
> >> > Jiangjie (Becket) Qin
> >> >
> >>
> >
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Hi Jun and Guozhang,

I have updated the KIP wiki to incorporate your comments. Please let me
know if you prefer starting another discussion thread for further
discussion.

Thanks,

Jiangjie (Becket) Qin

On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin <be...@gmail.com> wrote:

> Hi Guozhang and Jun,
>
> Thanks for the comments. Please see the responses below.
>
> Regarding to Guozhang's question #1 and Jun's question #12. I was
> inserting the time index and offset index entry together mostly for
> simplicity as Guozhang mentioned. The purpose of using index interval bytes
> for time index was to control the density of the time index, which is the
> same purpose as offset index. It seems reasonable to make them aligned. We
> can track separately the physical position when we insert the last time
> index entry(my original code did that), but when I wrote the code I feel it
> seems unnecessary. Another minor benefit is that searching by timestamp
> could be potentially faster if we align the time index and offset index.
> It is possible that we only have either a corrupted time index or an
> offset index, but not both. Although we can choose to only rebuild the one
> which is corrupted, given that we have to scan the entire log segment
> anyway, rebuilding both of them seems not much overhead. So the current
> patch I have is rebuilding both of them together.
>
> 10. Yes, it should only happen after a hard failure. The last time index
> entry of a normally closed segment has already points to the LEO, so there
> is no scan during start up.
>
> 11. On broker startup, if a time index does not exist, an empty one will
> be created first. If message format version is 0.9.0, we will append a time
> index entry of (last modification time -> base offset of next segment) to
> the time index of each inactive segment. So no actual rebuild will happen
> during upgrade. However, if message format version is 0.10.0, we will
> rebuild the time index if it does not exist. (I actually had a question
> about the how we are loading the log segments, we can discuss it in the PR)
>
> I will update the wiki to clarify the question raised in the comments and
> submit a PR by tomorrow. I am currently cleaning up the documentation.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
>
> On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:
>
>> Hi, Jiangjie,
>>
>> Thanks for the update. Looks good to me overall. Just a few minor comments
>> below.
>>
>> 10. On broker startup, it's not clear to me why we need to scan the log
>> segment to retrieve the largest timestamp since the time index always has
>> an entry for the largest timestamp. Is that only for restarting after a
>> hard failure?
>>
>> 11. On broker startup, if a log segment misses the time index, do we
>> always
>> rebuild it? This can happen when the broker is upgraded.
>>
>> 12. Related to Guozhang's question #1. It seems it's simpler to add time
>> index entries independent of the offset index since at index entry may not
>> be added to the offset and the time index at the same time. Also, this
>> allows time index to be rebuilt independently if needed.
>>
>> Thanks,
>>
>> Jun
>>
>>
>> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > I updated KIP-33 based on the initial implementation. Per discussion on
>> > yesterday's KIP hangout, I would like to initiate the new vote thread
>> for
>> > KIP-33.
>> >
>> > The KIP wiki:
>> >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>> >
>> > Here is a brief summary of the KIP:
>> > 1. We propose to add a time index for each log segment.
>> > 2. The time indices are going to be used of log retention, log rolling
>> and
>> > message search by timestamp.
>> >
>> > There was an old voting thread which has some discussions on this KIP.
>> The
>> > mail thread link is following:
>> >
>> >
>> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
>> >
>> > I have the following WIP patch for reference. It needs a few more unit
>> > tests and documentation. Other than that it should run fine.
>> >
>> >
>> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>>
>
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Becket Qin <be...@gmail.com>.

Hi Guozhang and Jun,

Thanks for the comments. Please see the responses below.

Regarding to Guozhang's question #1 and Jun's question #12. I was inserting
the time index and offset index entry together mostly for simplicity as
Guozhang mentioned. The purpose of using index interval bytes for time
index was to control the density of the time index, which is the same
purpose as offset index. It seems reasonable to make them aligned. We can
track separately the physical position when we insert the last time index
entry(my original code did that), but when I wrote the code I feel it seems
unnecessary. Another minor benefit is that searching by timestamp could be
potentially faster if we align the time index and offset index.
It is possible that we only have either a corrupted time index or an offset
index, but not both. Although we can choose to only rebuild the one which
is corrupted, given that we have to scan the entire log segment anyway,
rebuilding both of them seems not much overhead. So the current patch I
have is rebuilding both of them together.

10. Yes, it should only happen after a hard failure. The last time index
entry of a normally closed segment has already points to the LEO, so there
is no scan during start up.

11. On broker startup, if a time index does not exist, an empty one will be
created first. If message format version is 0.9.0, we will append a time
index entry of (last modification time -> base offset of next segment) to
the time index of each inactive segment. So no actual rebuild will happen
during upgrade. However, if message format version is 0.10.0, we will
rebuild the time index if it does not exist. (I actually had a question
about the how we are loading the log segments, we can discuss it in the PR)

I will update the wiki to clarify the question raised in the comments and
submit a PR by tomorrow. I am currently cleaning up the documentation.

Thanks,

Jiangjie (Becket) Qin

On Sun, Apr 10, 2016 at 9:25 PM, Jun Rao <ju...@confluent.io> wrote:

> Hi, Jiangjie,
>
> Thanks for the update. Looks good to me overall. Just a few minor comments
> below.
>
> 10. On broker startup, it's not clear to me why we need to scan the log
> segment to retrieve the largest timestamp since the time index always has
> an entry for the largest timestamp. Is that only for restarting after a
> hard failure?
>
> 11. On broker startup, if a log segment misses the time index, do we always
> rebuild it? This can happen when the broker is upgraded.
>
> 12. Related to Guozhang's question #1. It seems it's simpler to add time
> index entries independent of the offset index since at index entry may not
> be added to the offset and the time index at the same time. Also, this
> allows time index to be rebuilt independently if needed.
>
> Thanks,
>
> Jun
>
>
> On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com> wrote:
>
> > Hi all,
> >
> > I updated KIP-33 based on the initial implementation. Per discussion on
> > yesterday's KIP hangout, I would like to initiate the new vote thread for
> > KIP-33.
> >
> > The KIP wiki:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
> >
> > Here is a brief summary of the KIP:
> > 1. We propose to add a time index for each log segment.
> > 2. The time indices are going to be used of log retention, log rolling
> and
> > message search by timestamp.
> >
> > There was an old voting thread which has some discussions on this KIP.
> The
> > mail thread link is following:
> >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
> >
> > I have the following WIP patch for reference. It needs a few more unit
> > tests and documentation. Other than that it should run fine.
> >
> >
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
>

Re: [VOTE] KIP-33 - Add a time based log index

Posted by Jun Rao <ju...@confluent.io>.

Hi, Jiangjie,

Thanks for the update. Looks good to me overall. Just a few minor comments
below.

10. On broker startup, it's not clear to me why we need to scan the log
segment to retrieve the largest timestamp since the time index always has
an entry for the largest timestamp. Is that only for restarting after a
hard failure?

11. On broker startup, if a log segment misses the time index, do we always
rebuild it? This can happen when the broker is upgraded.

12. Related to Guozhang's question #1. It seems it's simpler to add time
index entries independent of the offset index since at index entry may not
be added to the offset and the time index at the same time. Also, this
allows time index to be rebuilt independently if needed.

Thanks,

Jun

On Wed, Apr 6, 2016 at 5:44 PM, Becket Qin <be...@gmail.com> wrote:

> Hi all,
>
> I updated KIP-33 based on the initial implementation. Per discussion on
> yesterday's KIP hangout, I would like to initiate the new vote thread for
> KIP-33.
>
> The KIP wiki:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index
>
> Here is a brief summary of the KIP:
> 1. We propose to add a time index for each log segment.
> 2. The time indices are going to be used of log retention, log rolling and
> message search by timestamp.
>
> There was an old voting thread which has some discussions on this KIP. The
> mail thread link is following:
>
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%3CCABtAgwGOEbuKYAPfPChMyCJK2Tepq3ngtUwNhtr2tJvsnC8peg@mail.gmail.com%3E
>
> I have the following WIP patch for reference. It needs a few more unit
> tests and documentation. Other than that it should run fine.
>
> https://github.com/becketqin/kafka/commit/712357a3fbf1423e05f9eed7d2fed5b6fe6c37b7
>
> Thanks,
>
> Jiangjie (Becket) Qin
>