You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Chris Burroughs <ch...@gmail.com> on 2011/07/27 03:19:10 UTC

On time/offset indexs

So for good reason [1] Kafka doesn't keep a complicated time --> offset
index.  Whatever is the start and end of log file is what you get.  We
can approximate finer grained time indexes with smaller log files [2]
and getOffsetsBefore, but we would really prefer not to have lots of
small files everywhere.

To solve the case of wanting time based indexes without lots of files
could we have another append only companion file for each Log that
periodically (I'm thinking on the order of 1 minute) gets
timestamp:offset appended to it?  That should have low overhead and if
the companion file is missing/deleted/etc we can still use the current
logic.

[1] "Furthermore the complexity of maintaining the mapping from a random
id to an offset requires a heavy weight index structure which must be
synchronized with disk, essentially requiring a full persistent
random-access data structure. " http://sna-projects.com/kafka/design.php

[2] And KAFKA-40 would make this easier to do.

Re: On time/offset indexs

Posted by Chris Burroughs <ch...@gmail.com>.

- Per partition or segment.

I think per segment is more useful and easier.  If it's per segment we
can just delete the index at the same time as the segment.  For per
partition I think we would have to do something other than append to a file.

- Use cases.

On minute is the finest resolution I could ever see using for web access
logs, verbose gc, syslog type data. (I'd be interested in hearing use
cases for finer resolution.) I agree it should be configurable.

- Low volume.

If the data volume is that low I suspect the extra seek to read the
index wouldn't be worth it.  Maybe there is a not too clever way to only
update the index if new data is flowing in?

- Should there be an option to turn it off?

Sure. If you never "rewind the queue" or start from a time other than
now the index would be useless to you.

On 07/27/2011 11:51 AM, Jun Rao wrote:
> Adding a separate index file is possible. Will there be 1 index file per
> partition or per segment? Is 1 minute interval good enough for typical use
> cases? Should we make the interval configurable? One downside of this is
> that for low volume data, there will be lots of entries pointing to the same
> offset. May be we should make the index optional?
> 
> Jun
> 
> On Tue, Jul 26, 2011 at 6:19 PM, Chris Burroughs
> <ch...@gmail.com>wrote:
> 
>> So for good reason [1] Kafka doesn't keep a complicated time --> offset
>> index.  Whatever is the start and end of log file is what you get.  We
>> can approximate finer grained time indexes with smaller log files [2]
>> and getOffsetsBefore, but we would really prefer not to have lots of
>> small files everywhere.
>>
>> To solve the case of wanting time based indexes without lots of files
>> could we have another append only companion file for each Log that
>> periodically (I'm thinking on the order of 1 minute) gets
>> timestamp:offset appended to it?  That should have low overhead and if
>> the companion file is missing/deleted/etc we can still use the current
>> logic.
>>
>> [1] "Furthermore the complexity of maintaining the mapping from a random
>> id to an offset requires a heavy weight index structure which must be
>> synchronized with disk, essentially requiring a full persistent
>> random-access data structure. " http://sna-projects.com/kafka/design.php
>>
>> [2] And KAFKA-40 would make this easier to do.
>>
>

Re: On time/offset indexs

Posted by Jun Rao <ju...@gmail.com>.

Adding a separate index file is possible. Will there be 1 index file per
partition or per segment? Is 1 minute interval good enough for typical use
cases? Should we make the interval configurable? One downside of this is
that for low volume data, there will be lots of entries pointing to the same
offset. May be we should make the index optional?

Jun

On Tue, Jul 26, 2011 at 6:19 PM, Chris Burroughs
<ch...@gmail.com>wrote:

> So for good reason [1] Kafka doesn't keep a complicated time --> offset
> index.  Whatever is the start and end of log file is what you get.  We
> can approximate finer grained time indexes with smaller log files [2]
> and getOffsetsBefore, but we would really prefer not to have lots of
> small files everywhere.
>
> To solve the case of wanting time based indexes without lots of files
> could we have another append only companion file for each Log that
> periodically (I'm thinking on the order of 1 minute) gets
> timestamp:offset appended to it?  That should have low overhead and if
> the companion file is missing/deleted/etc we can still use the current
> logic.
>
> [1] "Furthermore the complexity of maintaining the mapping from a random
> id to an offset requires a heavy weight index structure which must be
> synchronized with disk, essentially requiring a full persistent
> random-access data structure. " http://sna-projects.com/kafka/design.php
>
> [2] And KAFKA-40 would make this easier to do.
>