You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by S Ahmed <sa...@gmail.com> on 2012/11/26 04:54:40 UTC

understanding partitions based on wiki example of profile visits

The wiki states "Consider an application that would like to maintain an
aggregation of the number of profile visitors for each member. It would
like to send all profile visit events for a member to a particular
partition and, hence, have all updates for a member to appear in the same
stream for the same consumer thread." (
http://incubator.apache.org/kafka/design.html)

So say I have 5 broker servers, now my producer will send a message for a
particular profile page visit, with the default algorithm using
hash(member_id)%num_partitions
to figur out which broker server to send it it.

So a particular members pageview messages will all go to a single server
then, is this the case?  And therefore all the messages for a given user
will be in the correct order also right?

So a consumer group that subscribes to the 'profile-page-view' topic will
consume page view related messages, is it possible to subscribe to a
particular broker partition also?

Are broker partitions meant for cases when you want all messages to be
saved on the same node?

Re: understanding partitions based on wiki example of profile visits

Posted by Jay Kreps <ja...@gmail.com>.
We don't have a partition per user, there is no need for that. In the same
way a distributed database doesn't have a partition per user. A partition
is just a physical grouping of keys.

-Jay


On Tue, Nov 27, 2012 at 12:00 PM, S Ahmed <sa...@gmail.com> wrote:

> How does that work out though, I mean with 10 million users that is 10
> million  files at least.
>
>
> On Mon, Nov 26, 2012 at 2:02 PM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Yeah a partition is physically implemented as a log (i.e. a sequence of
> > files containing a bunch of messages indexed by offset). So each server
> can
> > have lots of partitions, but each partition exists entirely on a server.
> >
> > So in the "newsfeed" case if you partition by user id, you would be
> > guaranteed that all activity relevant to that user went to a single
> > processor. In our case, yes, we serve out of a different system which is
> > the destination after all the pre-processing.
> >
> >
> > On Mon, Nov 26, 2012 at 9:19 AM, S Ahmed <sa...@gmail.com> wrote:
> >
> > > >Yes, your description is correct. A particular member's data would all
> > be
> > > >in one partition.
> > > When you say in one partition, that also means on the same server?  Or
> a
> > > partition can span a brocker node?
> > >
> > > At the file level, I'm guessing it has its own physical file then? (or
> > set
> > > of files as it grows with the file number suffix).
> > >
> > > So at linkedIn, is this how you present a users dashboard inbox (your
> > > friend has a new job, they updated their profile, someone recommended
> > them,
> > > etc.)   I guess you can further sort at the application level then, and
> > > cache to a different store?
> > >
> > >
> > > On Mon, Nov 26, 2012 at 11:53 AM, Jay Kreps <ja...@gmail.com>
> wrote:
> > >
> > > > Yes, your description is correct. A particular member's data would
> all
> > be
> > > > in one partition.
> > > >
> > > > Broker partitions are just the unit of parallelism--think of each
> > > partition
> > > > as a totally ordered log you can append to and read from. The
> > consumption
> > > > of one of these partition logs is single threaded.
> > > >
> > > > The guarantee is that all messages are added to a partition in the
> > order
> > > > they arrive. From the point of view of a single producer client this
> > will
> > > > also be the order in which they are sent. These messages are then
> > > delivered
> > > > in this order to a consumer thread.
> > > >
> > > > Hope that helps.
> > > >
> > > > -Jay
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Nov 25, 2012 at 7:54 PM, S Ahmed <sa...@gmail.com>
> wrote:
> > > >
> > > > > The wiki states "Consider an application that would like to
> maintain
> > an
> > > > > aggregation of the number of profile visitors for each member. It
> > would
> > > > > like to send all profile visit events for a member to a particular
> > > > > partition and, hence, have all updates for a member to appear in
> the
> > > same
> > > > > stream for the same consumer thread." (
> > > > > http://incubator.apache.org/kafka/design.html)
> > > > >
> > > > > So say I have 5 broker servers, now my producer will send a message
> > > for a
> > > > > particular profile page visit, with the default algorithm using
> > > > > hash(member_id)%num_partitions
> > > > > to figur out which broker server to send it it.
> > > > >
> > > > > So a particular members pageview messages will all go to a single
> > > server
> > > > > then, is this the case?  And therefore all the messages for a given
> > > user
> > > > > will be in the correct order also right?
> > > > >
> > > > > So a consumer group that subscribes to the 'profile-page-view'
> topic
> > > will
> > > > > consume page view related messages, is it possible to subscribe to
> a
> > > > > particular broker partition also?
> > > > >
> > > > > Are broker partitions meant for cases when you want all messages to
> > be
> > > > > saved on the same node?
> > > > >
> > > >
> > >
> >
>

Re: understanding partitions based on wiki example of profile visits

Posted by S Ahmed <sa...@gmail.com>.
How does that work out though, I mean with 10 million users that is 10
million  files at least.


On Mon, Nov 26, 2012 at 2:02 PM, Jay Kreps <ja...@gmail.com> wrote:

> Yeah a partition is physically implemented as a log (i.e. a sequence of
> files containing a bunch of messages indexed by offset). So each server can
> have lots of partitions, but each partition exists entirely on a server.
>
> So in the "newsfeed" case if you partition by user id, you would be
> guaranteed that all activity relevant to that user went to a single
> processor. In our case, yes, we serve out of a different system which is
> the destination after all the pre-processing.
>
>
> On Mon, Nov 26, 2012 at 9:19 AM, S Ahmed <sa...@gmail.com> wrote:
>
> > >Yes, your description is correct. A particular member's data would all
> be
> > >in one partition.
> > When you say in one partition, that also means on the same server?  Or a
> > partition can span a brocker node?
> >
> > At the file level, I'm guessing it has its own physical file then? (or
> set
> > of files as it grows with the file number suffix).
> >
> > So at linkedIn, is this how you present a users dashboard inbox (your
> > friend has a new job, they updated their profile, someone recommended
> them,
> > etc.)   I guess you can further sort at the application level then, and
> > cache to a different store?
> >
> >
> > On Mon, Nov 26, 2012 at 11:53 AM, Jay Kreps <ja...@gmail.com> wrote:
> >
> > > Yes, your description is correct. A particular member's data would all
> be
> > > in one partition.
> > >
> > > Broker partitions are just the unit of parallelism--think of each
> > partition
> > > as a totally ordered log you can append to and read from. The
> consumption
> > > of one of these partition logs is single threaded.
> > >
> > > The guarantee is that all messages are added to a partition in the
> order
> > > they arrive. From the point of view of a single producer client this
> will
> > > also be the order in which they are sent. These messages are then
> > delivered
> > > in this order to a consumer thread.
> > >
> > > Hope that helps.
> > >
> > > -Jay
> > >
> > >
> > >
> > >
> > > On Sun, Nov 25, 2012 at 7:54 PM, S Ahmed <sa...@gmail.com> wrote:
> > >
> > > > The wiki states "Consider an application that would like to maintain
> an
> > > > aggregation of the number of profile visitors for each member. It
> would
> > > > like to send all profile visit events for a member to a particular
> > > > partition and, hence, have all updates for a member to appear in the
> > same
> > > > stream for the same consumer thread." (
> > > > http://incubator.apache.org/kafka/design.html)
> > > >
> > > > So say I have 5 broker servers, now my producer will send a message
> > for a
> > > > particular profile page visit, with the default algorithm using
> > > > hash(member_id)%num_partitions
> > > > to figur out which broker server to send it it.
> > > >
> > > > So a particular members pageview messages will all go to a single
> > server
> > > > then, is this the case?  And therefore all the messages for a given
> > user
> > > > will be in the correct order also right?
> > > >
> > > > So a consumer group that subscribes to the 'profile-page-view' topic
> > will
> > > > consume page view related messages, is it possible to subscribe to a
> > > > particular broker partition also?
> > > >
> > > > Are broker partitions meant for cases when you want all messages to
> be
> > > > saved on the same node?
> > > >
> > >
> >
>

Re: understanding partitions based on wiki example of profile visits

Posted by Jay Kreps <ja...@gmail.com>.
Yeah a partition is physically implemented as a log (i.e. a sequence of
files containing a bunch of messages indexed by offset). So each server can
have lots of partitions, but each partition exists entirely on a server.

So in the "newsfeed" case if you partition by user id, you would be
guaranteed that all activity relevant to that user went to a single
processor. In our case, yes, we serve out of a different system which is
the destination after all the pre-processing.


On Mon, Nov 26, 2012 at 9:19 AM, S Ahmed <sa...@gmail.com> wrote:

> >Yes, your description is correct. A particular member's data would all be
> >in one partition.
> When you say in one partition, that also means on the same server?  Or a
> partition can span a brocker node?
>
> At the file level, I'm guessing it has its own physical file then? (or set
> of files as it grows with the file number suffix).
>
> So at linkedIn, is this how you present a users dashboard inbox (your
> friend has a new job, they updated their profile, someone recommended them,
> etc.)   I guess you can further sort at the application level then, and
> cache to a different store?
>
>
> On Mon, Nov 26, 2012 at 11:53 AM, Jay Kreps <ja...@gmail.com> wrote:
>
> > Yes, your description is correct. A particular member's data would all be
> > in one partition.
> >
> > Broker partitions are just the unit of parallelism--think of each
> partition
> > as a totally ordered log you can append to and read from. The consumption
> > of one of these partition logs is single threaded.
> >
> > The guarantee is that all messages are added to a partition in the order
> > they arrive. From the point of view of a single producer client this will
> > also be the order in which they are sent. These messages are then
> delivered
> > in this order to a consumer thread.
> >
> > Hope that helps.
> >
> > -Jay
> >
> >
> >
> >
> > On Sun, Nov 25, 2012 at 7:54 PM, S Ahmed <sa...@gmail.com> wrote:
> >
> > > The wiki states "Consider an application that would like to maintain an
> > > aggregation of the number of profile visitors for each member. It would
> > > like to send all profile visit events for a member to a particular
> > > partition and, hence, have all updates for a member to appear in the
> same
> > > stream for the same consumer thread." (
> > > http://incubator.apache.org/kafka/design.html)
> > >
> > > So say I have 5 broker servers, now my producer will send a message
> for a
> > > particular profile page visit, with the default algorithm using
> > > hash(member_id)%num_partitions
> > > to figur out which broker server to send it it.
> > >
> > > So a particular members pageview messages will all go to a single
> server
> > > then, is this the case?  And therefore all the messages for a given
> user
> > > will be in the correct order also right?
> > >
> > > So a consumer group that subscribes to the 'profile-page-view' topic
> will
> > > consume page view related messages, is it possible to subscribe to a
> > > particular broker partition also?
> > >
> > > Are broker partitions meant for cases when you want all messages to be
> > > saved on the same node?
> > >
> >
>

Re: understanding partitions based on wiki example of profile visits

Posted by S Ahmed <sa...@gmail.com>.
>Yes, your description is correct. A particular member's data would all be
>in one partition.
When you say in one partition, that also means on the same server?  Or a
partition can span a brocker node?

At the file level, I'm guessing it has its own physical file then? (or set
of files as it grows with the file number suffix).

So at linkedIn, is this how you present a users dashboard inbox (your
friend has a new job, they updated their profile, someone recommended them,
etc.)   I guess you can further sort at the application level then, and
cache to a different store?


On Mon, Nov 26, 2012 at 11:53 AM, Jay Kreps <ja...@gmail.com> wrote:

> Yes, your description is correct. A particular member's data would all be
> in one partition.
>
> Broker partitions are just the unit of parallelism--think of each partition
> as a totally ordered log you can append to and read from. The consumption
> of one of these partition logs is single threaded.
>
> The guarantee is that all messages are added to a partition in the order
> they arrive. From the point of view of a single producer client this will
> also be the order in which they are sent. These messages are then delivered
> in this order to a consumer thread.
>
> Hope that helps.
>
> -Jay
>
>
>
>
> On Sun, Nov 25, 2012 at 7:54 PM, S Ahmed <sa...@gmail.com> wrote:
>
> > The wiki states "Consider an application that would like to maintain an
> > aggregation of the number of profile visitors for each member. It would
> > like to send all profile visit events for a member to a particular
> > partition and, hence, have all updates for a member to appear in the same
> > stream for the same consumer thread." (
> > http://incubator.apache.org/kafka/design.html)
> >
> > So say I have 5 broker servers, now my producer will send a message for a
> > particular profile page visit, with the default algorithm using
> > hash(member_id)%num_partitions
> > to figur out which broker server to send it it.
> >
> > So a particular members pageview messages will all go to a single server
> > then, is this the case?  And therefore all the messages for a given user
> > will be in the correct order also right?
> >
> > So a consumer group that subscribes to the 'profile-page-view' topic will
> > consume page view related messages, is it possible to subscribe to a
> > particular broker partition also?
> >
> > Are broker partitions meant for cases when you want all messages to be
> > saved on the same node?
> >
>

Re: understanding partitions based on wiki example of profile visits

Posted by Jay Kreps <ja...@gmail.com>.
Yes, your description is correct. A particular member's data would all be
in one partition.

Broker partitions are just the unit of parallelism--think of each partition
as a totally ordered log you can append to and read from. The consumption
of one of these partition logs is single threaded.

The guarantee is that all messages are added to a partition in the order
they arrive. From the point of view of a single producer client this will
also be the order in which they are sent. These messages are then delivered
in this order to a consumer thread.

Hope that helps.

-Jay




On Sun, Nov 25, 2012 at 7:54 PM, S Ahmed <sa...@gmail.com> wrote:

> The wiki states "Consider an application that would like to maintain an
> aggregation of the number of profile visitors for each member. It would
> like to send all profile visit events for a member to a particular
> partition and, hence, have all updates for a member to appear in the same
> stream for the same consumer thread." (
> http://incubator.apache.org/kafka/design.html)
>
> So say I have 5 broker servers, now my producer will send a message for a
> particular profile page visit, with the default algorithm using
> hash(member_id)%num_partitions
> to figur out which broker server to send it it.
>
> So a particular members pageview messages will all go to a single server
> then, is this the case?  And therefore all the messages for a given user
> will be in the correct order also right?
>
> So a consumer group that subscribes to the 'profile-page-view' topic will
> consume page view related messages, is it possible to subscribe to a
> particular broker partition also?
>
> Are broker partitions meant for cases when you want all messages to be
> saved on the same node?
>

Re: understanding partitions based on wiki example of profile visits

Posted by S Ahmed <sa...@gmail.com>.
sorry wrong list.


On Sun, Nov 25, 2012 at 10:54 PM, S Ahmed <sa...@gmail.com> wrote:

> The wiki states "Consider an application that would like to maintain an
> aggregation of the number of profile visitors for each member. It would
> like to send all profile visit events for a member to a particular
> partition and, hence, have all updates for a member to appear in the same
> stream for the same consumer thread." (
> http://incubator.apache.org/kafka/design.html)
>
> So say I have 5 broker servers, now my producer will send a message for a
> particular profile page visit, with the default algorithm using hash(member_id)%num_partitions
> to figur out which broker server to send it it.
>
> So a particular members pageview messages will all go to a single server
> then, is this the case?  And therefore all the messages for a given user
> will be in the correct order also right?
>
> So a consumer group that subscribes to the 'profile-page-view' topic will
> consume page view related messages, is it possible to subscribe to a
> particular broker partition also?
>
> Are broker partitions meant for cases when you want all messages to be
> saved on the same node?
>

Re: understanding partitions based on wiki example of profile visits

Posted by Jay Kreps <ja...@gmail.com>.
Yes, your description is correct. A particular member's data would all be
in one partition.

Broker partitions are just the unit of parallelism--think of each partition
as a totally ordered log you can append to and read from. The consumption
of one of these partition logs is single threaded.

The guarantee is that all messages are added to a partition in the order
they arrive. From the point of view of a single producer client this will
also be the order in which they are sent. These messages are then delivered
in this order to a consumer thread.

Hope that helps.

-Jay




On Sun, Nov 25, 2012 at 7:54 PM, S Ahmed <sa...@gmail.com> wrote:

> The wiki states "Consider an application that would like to maintain an
> aggregation of the number of profile visitors for each member. It would
> like to send all profile visit events for a member to a particular
> partition and, hence, have all updates for a member to appear in the same
> stream for the same consumer thread." (
> http://incubator.apache.org/kafka/design.html)
>
> So say I have 5 broker servers, now my producer will send a message for a
> particular profile page visit, with the default algorithm using
> hash(member_id)%num_partitions
> to figur out which broker server to send it it.
>
> So a particular members pageview messages will all go to a single server
> then, is this the case?  And therefore all the messages for a given user
> will be in the correct order also right?
>
> So a consumer group that subscribes to the 'profile-page-view' topic will
> consume page view related messages, is it possible to subscribe to a
> particular broker partition also?
>
> Are broker partitions meant for cases when you want all messages to be
> saved on the same node?
>