You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by S Ahmed <sa...@gmail.com> on 2012/06/13 15:40:05 UTC

random access

I was thinking of replicating messages to a central location, and having a
very long expire date on the messages (like say 1 year).

My requirement would be able to not just stream messages, but access
messages by key, similiar to a "SELECT * FROM TABLE WHERE id=123"

>From I understand, currently their is no index file that maps messages to
their exact location in a file correct?  i.e. kafka streams the messages,
so it goes to a .kafka file, starts from the beginning and streams the data
to a consumer.  If your offset happends to be in the middle of the file, it
will scan the file, start at the beginning of the message, figure out the
length of the message, and then jump to the position of the next message
until it finds the correct message offset, is this correct?

i.e. I would have to create some sort of index that maps the offset to the
'messageId' (where the messageId is stored in the body of the message
itself).

Re: random access

Posted by Jay Kreps <ja...@gmail.com>.

If the access is by offset then there will be one seek (if the data doesn't
fit in memory) or no seeks (if it is cached). The pagecache will
automatically fill all free memory on the machine. If the access is by some
secondary index of key=>offset that you maintain then it will depend on the
efficiency of your index.

-Jay

On Wed, Jun 13, 2012 at 7:49 AM, S Ahmed <sa...@gmail.com> wrote:

> So I'll just have to create one then I guess if I want to do this.  I was
> planning on doing this:
>
> prod#1 -> kafka#1 -> consumer  -> prod#2 -> kafka#2 central
>
> kafka-central will have long lasting messages.
>
> So in the consumer that pulls off the kafka#2 will filter messages, and
> then I can create an index that maps offset to messageId.
>
> Just wondering how fast random access to a kafka fill will be, like will it
> be as fast as a db lookup.  it's a memory mapped file so it should be fast
> in theory but when the # of files grows things will degrade.
>
> On Wed, Jun 13, 2012 at 10:01 AM, Jay Kreps <ja...@gmail.com> wrote:
>
> > There is no scanning, we compute the message location from the offset and
> > begin fetching there.
> >
> > Sent from my iPhone
> >
> > On Jun 13, 2012, at 6:40 AM, S Ahmed <sa...@gmail.com> wrote:
> >
> > > I was thinking of replicating messages to a central location, and
> having
> > a
> > > very long expire date on the messages (like say 1 year).
> > >
> > > My requirement would be able to not just stream messages, but access
> > > messages by key, similiar to a "SELECT * FROM TABLE WHERE id=123"
> > >
> > > From I understand, currently their is no index file that maps messages
> to
> > > their exact location in a file correct?  i.e. kafka streams the
> messages,
> > > so it goes to a .kafka file, starts from the beginning and streams the
> > data
> > > to a consumer.  If your offset happends to be in the middle of the
> file,
> > it
> > > will scan the file, start at the beginning of the message, figure out
> the
> > > length of the message, and then jump to the position of the next
> message
> > > until it finds the correct message offset, is this correct?
> > >
> > > i.e. I would have to create some sort of index that maps the offset to
> > the
> > > 'messageId' (where the messageId is stored in the body of the message
> > > itself).
> >
>

Re: random access

Posted by S Ahmed <sa...@gmail.com>.

So I'll just have to create one then I guess if I want to do this.  I was
planning on doing this:

prod#1 -> kafka#1 -> consumer  -> prod#2 -> kafka#2 central

kafka-central will have long lasting messages.

So in the consumer that pulls off the kafka#2 will filter messages, and
then I can create an index that maps offset to messageId.

Just wondering how fast random access to a kafka fill will be, like will it
be as fast as a db lookup.  it's a memory mapped file so it should be fast
in theory but when the # of files grows things will degrade.

On Wed, Jun 13, 2012 at 10:01 AM, Jay Kreps <ja...@gmail.com> wrote:

> There is no scanning, we compute the message location from the offset and
> begin fetching there.
>
> Sent from my iPhone
>
> On Jun 13, 2012, at 6:40 AM, S Ahmed <sa...@gmail.com> wrote:
>
> > I was thinking of replicating messages to a central location, and having
> a
> > very long expire date on the messages (like say 1 year).
> >
> > My requirement would be able to not just stream messages, but access
> > messages by key, similiar to a "SELECT * FROM TABLE WHERE id=123"
> >
> > From I understand, currently their is no index file that maps messages to
> > their exact location in a file correct?  i.e. kafka streams the messages,
> > so it goes to a .kafka file, starts from the beginning and streams the
> data
> > to a consumer.  If your offset happends to be in the middle of the file,
> it
> > will scan the file, start at the beginning of the message, figure out the
> > length of the message, and then jump to the position of the next message
> > until it finds the correct message offset, is this correct?
> >
> > i.e. I would have to create some sort of index that maps the offset to
> the
> > 'messageId' (where the messageId is stored in the body of the message
> > itself).
>

Re: random access

Posted by Jay Kreps <ja...@gmail.com>.

There is no scanning, we compute the message location from the offset and begin fetching there.

Sent from my iPhone

On Jun 13, 2012, at 6:40 AM, S Ahmed <sa...@gmail.com> wrote:

> I was thinking of replicating messages to a central location, and having a
> very long expire date on the messages (like say 1 year).
> 
> My requirement would be able to not just stream messages, but access
> messages by key, similiar to a "SELECT * FROM TABLE WHERE id=123"
> 
> From I understand, currently their is no index file that maps messages to
> their exact location in a file correct?  i.e. kafka streams the messages,
> so it goes to a .kafka file, starts from the beginning and streams the data
> to a consumer.  If your offset happends to be in the middle of the file, it
> will scan the file, start at the beginning of the message, figure out the
> length of the message, and then jump to the position of the next message
> until it finds the correct message offset, is this correct?
> 
> i.e. I would have to create some sort of index that maps the offset to the
> 'messageId' (where the messageId is stored in the body of the message
> itself).