You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Gary Ogden <go...@gmail.com> on 2015/02/11 21:13:31 UTC

understanding partition key

I'm trying to understand how the partition key works and whether I need to
specify a partition key for my topics or not.   What happens if I don't
specify a PK and I have more than one consumer that wants all messages in a
topic for a certain period of time? Will those consumers get all the
messages, but they just may not be ordered correctly?

The current scenario is that we will have events going into a topic based
on customer and the data will remain in the topic for 24 hours. We will
then have multiple consumers reading messages from that topic. They will
want to be able to get them out over a time range (could be last hour, last
8 hours etc).

So if I specify the same PK for each subscriber, then each consumer will
get all messages in the correct order?  If I don't specify the PK or use a
random one, will each consumer still get all the messages but they just
won't be ordered correctly?

Re: understanding partition key

Posted by David McNelis <dm...@emergingthreats.net>.
Sorry... I probably didn't word that well.  When I say each message, that's
assuming your going out to poll for new messages on the topic.  I.e. if you
have a latency tolerance of 1 second, then you'd never want to go more than
1 second without polling for new messages.

On Thu, Feb 12, 2015 at 12:15 PM, Gary Ogden <go...@gmail.com> wrote:

> Thanks.
>
> I was under the impression that you can't process the data as each record
> comes into the topic? That Kafka doesn't support pushing messages to a
> consumer like a traditional message queue?
>
> On 12 February 2015 at 11:06, David McNelis <dm...@emergingthreats.net>
> wrote:
>
> > I'm going to go a bit in reverse for your questions. We built a restful
> API
> > to push data to so that we could submit things from multiple sources that
> > aren't necessarily things that our team would maintain, as well as
> validate
> > that data before we send it off to a topic.
> >
> > As for consumers... we expect that a consumer will receive everything in
> a
> > particular partition, but definitely not everything in a topic.  For
> > example, for DNS information, a consumer expects to see everything for a
> > domain (our partition key there), but not for all domains... it would be
> > too much data to handle with a low enough latency.
> >
> > For us, our latencies vary by what we're processing at a given moment or
> > other large jobs that might be running outside of the stream processing.
> > Generally speaking we keep latency within our limits by adjusting the
> > number of partitions for a topic, and subsequently the number of
> > consumers...which works until you hit a wall with your data store, in
> which
> > case tweaking must happen there too.
> >
> > Your flow of data will be the biggest impact on latency (along with batch
> > size)....  If you see 100 events a minute, and have a batch size of 1k,
> > you'll have 10 minutes of latency... if you process your data as each
> > record comes in, then as long as your consumer can keep up with that
> load,
> > you'll have very low latency.
> >
> > Kafka's latency has never been an issue for us, and we've always found
> > there's something that we're doing that is affecting the overall
> throughput
> > of the systems, be it needing to play with the number of partitions,
> > adjusting batch size, resizing the hadoop cluster to meet increased need
> > there.
> >
> > On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <go...@gmail.com> wrote:
> >
> > > Thanks again David. So what kind of latencies are you experiencing with
> > > this? If I wanted to act upon certain events in this and send out
> alarms
> > > (email, sms etc), what kind of delays are you seeing by the time you're
> > > able to process them?
> > >
> > > It seems if you were to create an alarm topic, and dump alerts on there
> > to
> > > be processed, and have a consumer then process those alerts (save to
> > > Cassandra and send out the notification), you could see some sort of
> > delay.
> > >
> > > But none of you consumers are expecting that they will be getting all
> the
> > > data for that topic, are they? That's the hitch for me. I think I could
> > > rework the design so that no consumer would assume it's getting all the
> > > messages in the topic.
> > >
> > > When you say central producer stack, this is something you built
> outside
> > of
> > > kafka I assume.
> > >
> > > On 12 February 2015 at 09:40, David McNelis <
> > dmcnelis@emergingthreats.net>
> > > wrote:
> > >
> > > > In our setup we deal with a similar situation (lots of time-series
> data
> > > > that we have to aggregate in a number of different ways).
> > > >
> > > > Our approach is to push all of our data to a central producer stack,
> > that
> > > > stack then submits data to different topics, depending on a set of
> > > > predetermined rules.
> > > >
> > > > For arguments sake, lets say we see 100 events / second... Once the
> > data
> > > is
> > > > pushed to topics we have several consumer applications.  The first
> > pushes
> > > > the lowest level raw data into our cluster (in our case HBase, but
> > could
> > > > just as easily be C*). This first consumer we use a small batch
> > size....
> > > > say 100 records.
> > > >
> > > > Our second consumer does some aggregation and summarization and needs
> > to
> > > > reference additional resources.  This batch size is considerably
> > larger,
> > > > say 1000 records in size.
> > > >
> > > > Our third consumer does a much larger aggregation where we're
> reducing
> > it
> > > > down to a much smaller dataset, by orders of magnitude, and we'll run
> > > that
> > > > batch size at 10k records.
> > > >
> > > > Our aggregations are oftentimes time based, like rolling events /
> > > counters
> > > > to the hour or day level... so while we may occasionally re-process
> > some
> > > of
> > > > the same information throughout the day, it lets us maintain a system
> > > where
> > > > we have near-real-time access to most of the data we're ingesting.
> > > >
> > > > This certainly is something we've had to tweak in terms of the
> numbers
> > of
> > > > consumers / partitions and batch sizes to get to work optimally, but
> > > we're
> > > > pretty happy with it at the moment.
> > > >
> > > > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <go...@gmail.com>
> wrote:
> > > >
> > > > > Thanks David. Whether Kafka is the right choice is exactly what I'm
> > > > trying
> > > > > to determine.  Everything I want to do with these events is time
> > based.
> > > > > Store them in the topic for 24 hours. Read from the topics and get
> > data
> > > > for
> > > > > a time period (last hour , last 8 hours etc). This reading from the
> > > > topics
> > > > > could happen numerous times for the same dataset.  At the end of
> the
> > 24
> > > > > hours, we would then want to store the events in long term storage
> in
> > > > case
> > > > > we need them in the future.
> > > > >
> > > > > I'm thinking Kafka isn't the right fit for this use case though.
> > > > However,
> > > > > we have cassandra already installed and running. Maybe we use kafka
> > as
> > > > the
> > > > > in-between... So dump the events onto kafka, have consumers that
> > > process
> > > > > the events and dump them into Cassandra and then we read the the
> time
> > > > > series data we need from Cassandra and not kafka.
> > > > >
> > > > > My concern with this method would be two things:
> > > > > 1 - the time delay of dumping them on the queue and then processing
> > > them
> > > > > into Cassandra before the events are available.
> > > > > 2 - Is there any need for kafka now? Why can't the code that puts
> the
> > > > > messages on the topic just directly insert into Cassandra then and
> > > avoid
> > > > > this extra hop?
> > > > >
> > > > >
> > > > >
> > > > > On 12 February 2015 at 09:01, David McNelis <
> > > > dmcnelis@emergingthreats.net>
> > > > > wrote:
> > > > >
> > > > > > Gary,
> > > > > >
> > > > > > That is certainly a valid use case.  What Zijing was saying is
> that
> > > you
> > > > > can
> > > > > > only have 1 consumer per consumer application per partition.
> > > > > >
> > > > > > I think that what it boils down to is how you want your
> information
> > > > > grouped
> > > > > > inside your timeframes.  For example, if you want to have
> > everything
> > > > for
> > > > > a
> > > > > > specific user, then you could use that as your partition key,
> > > ensuring
> > > > > that
> > > > > > any data for that user is processed by the same consumer (in
> > however
> > > > many
> > > > > > consumer applications you opt to run).
> > > > > >
> > > > > > The second point that I think Zijing was getting at was whether
> or
> > > not
> > > > > your
> > > > > > proposed use case makes sense for Kafka.  If your goal is to do
> > > > > > time-interval batch processing (versus N-record batches), then
> why
> > > use
> > > > > > Kafka for it?  Why not use something more adept at batch
> > processing?
> > > > For
> > > > > > example, if you're using HBase you can use Pig jobs that would
> read
> > > > only
> > > > > > the records created between specific timestamps.
> > > > > >
> > > > > > David
> > > > > >
> > > > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > So it's not possible to have 1 topic with 1 partition and many
> > > > > consumers
> > > > > > of
> > > > > > > that topic?
> > > > > > >
> > > > > > > My intention is to have a topic with many consumers, but each
> > > > consumer
> > > > > > > needs to be able to have access to all the messages in that
> > topic.
> > > > > > >
> > > > > > > On 11 February 2015 at 20:42, Zijing Guo
> > > <altergzj@yahoo.com.invalid
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Partition key is on producer level, that if you have multiple
> > > > > > partitions
> > > > > > > > for a single topic, then you can pass in a key for the
> > > KeyedMessage
> > > > > > > object,
> > > > > > > > and base on different partition.class, it will return a
> > partition
> > > > > > number
> > > > > > > > for the producer, and producer will find the leader for that
> > > > > > partition.I
> > > > > > > > don't know how kafka could handle time series case, but
> depends
> > > on
> > > > > how
> > > > > > > many
> > > > > > > > partitions for that topic. If you only have 1 partition, then
> > you
> > > > > don't
> > > > > > > > need to worry about order at all, since each consumer group
> can
> > > > only
> > > > > > > allow
> > > > > > > > 1 consumer instance to consume that data.  if you have
> multiple
> > > > > > > partitions
> > > > > > > > (say 3 for example), then you can fire up 3 consumer
> instances
> > > > under
> > > > > > the
> > > > > > > > same consumer group, and each will only consume 1 partition's
> > > data.
> > > > > if
> > > > > > > > order in each partition matters, then you need to do some
> work
> > on
> > > > the
> > > > > > > > producer side.Hope this helpsEdwin
> > > > > > > >
> > > > > > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > > > > > gogden@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >  I'm trying to understand how the partition key works and
> > > whether I
> > > > > > need
> > > > > > > to
> > > > > > > > specify a partition key for my topics or not.  What happens
> if
> > I
> > > > > don't
> > > > > > > > specify a PK and I have more than one consumer that wants all
> > > > > messages
> > > > > > > in a
> > > > > > > > topic for a certain period of time? Will those consumers get
> > all
> > > > the
> > > > > > > > messages, but they just may not be ordered correctly?
> > > > > > > >
> > > > > > > > The current scenario is that we will have events going into a
> > > topic
> > > > > > based
> > > > > > > > on customer and the data will remain in the topic for 24
> hours.
> > > We
> > > > > will
> > > > > > > > then have multiple consumers reading messages from that
> topic.
> > > They
> > > > > > will
> > > > > > > > want to be able to get them out over a time range (could be
> > last
> > > > > hour,
> > > > > > > last
> > > > > > > > 8 hours etc).
> > > > > > > >
> > > > > > > > So if I specify the same PK for each subscriber, then each
> > > consumer
> > > > > > will
> > > > > > > > get all messages in the correct order?  If I don't specify
> the
> > PK
> > > > or
> > > > > > use
> > > > > > > a
> > > > > > > > random one, will each consumer still get all the messages but
> > > they
> > > > > just
> > > > > > > > won't be ordered correctly?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: understanding partition key

Posted by Gary Ogden <go...@gmail.com>.
Thanks.

I was under the impression that you can't process the data as each record
comes into the topic? That Kafka doesn't support pushing messages to a
consumer like a traditional message queue?

On 12 February 2015 at 11:06, David McNelis <dm...@emergingthreats.net>
wrote:

> I'm going to go a bit in reverse for your questions. We built a restful API
> to push data to so that we could submit things from multiple sources that
> aren't necessarily things that our team would maintain, as well as validate
> that data before we send it off to a topic.
>
> As for consumers... we expect that a consumer will receive everything in a
> particular partition, but definitely not everything in a topic.  For
> example, for DNS information, a consumer expects to see everything for a
> domain (our partition key there), but not for all domains... it would be
> too much data to handle with a low enough latency.
>
> For us, our latencies vary by what we're processing at a given moment or
> other large jobs that might be running outside of the stream processing.
> Generally speaking we keep latency within our limits by adjusting the
> number of partitions for a topic, and subsequently the number of
> consumers...which works until you hit a wall with your data store, in which
> case tweaking must happen there too.
>
> Your flow of data will be the biggest impact on latency (along with batch
> size)....  If you see 100 events a minute, and have a batch size of 1k,
> you'll have 10 minutes of latency... if you process your data as each
> record comes in, then as long as your consumer can keep up with that load,
> you'll have very low latency.
>
> Kafka's latency has never been an issue for us, and we've always found
> there's something that we're doing that is affecting the overall throughput
> of the systems, be it needing to play with the number of partitions,
> adjusting batch size, resizing the hadoop cluster to meet increased need
> there.
>
> On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <go...@gmail.com> wrote:
>
> > Thanks again David. So what kind of latencies are you experiencing with
> > this? If I wanted to act upon certain events in this and send out alarms
> > (email, sms etc), what kind of delays are you seeing by the time you're
> > able to process them?
> >
> > It seems if you were to create an alarm topic, and dump alerts on there
> to
> > be processed, and have a consumer then process those alerts (save to
> > Cassandra and send out the notification), you could see some sort of
> delay.
> >
> > But none of you consumers are expecting that they will be getting all the
> > data for that topic, are they? That's the hitch for me. I think I could
> > rework the design so that no consumer would assume it's getting all the
> > messages in the topic.
> >
> > When you say central producer stack, this is something you built outside
> of
> > kafka I assume.
> >
> > On 12 February 2015 at 09:40, David McNelis <
> dmcnelis@emergingthreats.net>
> > wrote:
> >
> > > In our setup we deal with a similar situation (lots of time-series data
> > > that we have to aggregate in a number of different ways).
> > >
> > > Our approach is to push all of our data to a central producer stack,
> that
> > > stack then submits data to different topics, depending on a set of
> > > predetermined rules.
> > >
> > > For arguments sake, lets say we see 100 events / second... Once the
> data
> > is
> > > pushed to topics we have several consumer applications.  The first
> pushes
> > > the lowest level raw data into our cluster (in our case HBase, but
> could
> > > just as easily be C*). This first consumer we use a small batch
> size....
> > > say 100 records.
> > >
> > > Our second consumer does some aggregation and summarization and needs
> to
> > > reference additional resources.  This batch size is considerably
> larger,
> > > say 1000 records in size.
> > >
> > > Our third consumer does a much larger aggregation where we're reducing
> it
> > > down to a much smaller dataset, by orders of magnitude, and we'll run
> > that
> > > batch size at 10k records.
> > >
> > > Our aggregations are oftentimes time based, like rolling events /
> > counters
> > > to the hour or day level... so while we may occasionally re-process
> some
> > of
> > > the same information throughout the day, it lets us maintain a system
> > where
> > > we have near-real-time access to most of the data we're ingesting.
> > >
> > > This certainly is something we've had to tweak in terms of the numbers
> of
> > > consumers / partitions and batch sizes to get to work optimally, but
> > we're
> > > pretty happy with it at the moment.
> > >
> > > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <go...@gmail.com> wrote:
> > >
> > > > Thanks David. Whether Kafka is the right choice is exactly what I'm
> > > trying
> > > > to determine.  Everything I want to do with these events is time
> based.
> > > > Store them in the topic for 24 hours. Read from the topics and get
> data
> > > for
> > > > a time period (last hour , last 8 hours etc). This reading from the
> > > topics
> > > > could happen numerous times for the same dataset.  At the end of the
> 24
> > > > hours, we would then want to store the events in long term storage in
> > > case
> > > > we need them in the future.
> > > >
> > > > I'm thinking Kafka isn't the right fit for this use case though.
> > > However,
> > > > we have cassandra already installed and running. Maybe we use kafka
> as
> > > the
> > > > in-between... So dump the events onto kafka, have consumers that
> > process
> > > > the events and dump them into Cassandra and then we read the the time
> > > > series data we need from Cassandra and not kafka.
> > > >
> > > > My concern with this method would be two things:
> > > > 1 - the time delay of dumping them on the queue and then processing
> > them
> > > > into Cassandra before the events are available.
> > > > 2 - Is there any need for kafka now? Why can't the code that puts the
> > > > messages on the topic just directly insert into Cassandra then and
> > avoid
> > > > this extra hop?
> > > >
> > > >
> > > >
> > > > On 12 February 2015 at 09:01, David McNelis <
> > > dmcnelis@emergingthreats.net>
> > > > wrote:
> > > >
> > > > > Gary,
> > > > >
> > > > > That is certainly a valid use case.  What Zijing was saying is that
> > you
> > > > can
> > > > > only have 1 consumer per consumer application per partition.
> > > > >
> > > > > I think that what it boils down to is how you want your information
> > > > grouped
> > > > > inside your timeframes.  For example, if you want to have
> everything
> > > for
> > > > a
> > > > > specific user, then you could use that as your partition key,
> > ensuring
> > > > that
> > > > > any data for that user is processed by the same consumer (in
> however
> > > many
> > > > > consumer applications you opt to run).
> > > > >
> > > > > The second point that I think Zijing was getting at was whether or
> > not
> > > > your
> > > > > proposed use case makes sense for Kafka.  If your goal is to do
> > > > > time-interval batch processing (versus N-record batches), then why
> > use
> > > > > Kafka for it?  Why not use something more adept at batch
> processing?
> > > For
> > > > > example, if you're using HBase you can use Pig jobs that would read
> > > only
> > > > > the records created between specific timestamps.
> > > > >
> > > > > David
> > > > >
> > > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com>
> > wrote:
> > > > >
> > > > > > So it's not possible to have 1 topic with 1 partition and many
> > > > consumers
> > > > > of
> > > > > > that topic?
> > > > > >
> > > > > > My intention is to have a topic with many consumers, but each
> > > consumer
> > > > > > needs to be able to have access to all the messages in that
> topic.
> > > > > >
> > > > > > On 11 February 2015 at 20:42, Zijing Guo
> > <altergzj@yahoo.com.invalid
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Partition key is on producer level, that if you have multiple
> > > > > partitions
> > > > > > > for a single topic, then you can pass in a key for the
> > KeyedMessage
> > > > > > object,
> > > > > > > and base on different partition.class, it will return a
> partition
> > > > > number
> > > > > > > for the producer, and producer will find the leader for that
> > > > > partition.I
> > > > > > > don't know how kafka could handle time series case, but depends
> > on
> > > > how
> > > > > > many
> > > > > > > partitions for that topic. If you only have 1 partition, then
> you
> > > > don't
> > > > > > > need to worry about order at all, since each consumer group can
> > > only
> > > > > > allow
> > > > > > > 1 consumer instance to consume that data.  if you have multiple
> > > > > > partitions
> > > > > > > (say 3 for example), then you can fire up 3 consumer instances
> > > under
> > > > > the
> > > > > > > same consumer group, and each will only consume 1 partition's
> > data.
> > > > if
> > > > > > > order in each partition matters, then you need to do some work
> on
> > > the
> > > > > > > producer side.Hope this helpsEdwin
> > > > > > >
> > > > > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > > > > gogden@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > >  I'm trying to understand how the partition key works and
> > whether I
> > > > > need
> > > > > > to
> > > > > > > specify a partition key for my topics or not.  What happens if
> I
> > > > don't
> > > > > > > specify a PK and I have more than one consumer that wants all
> > > > messages
> > > > > > in a
> > > > > > > topic for a certain period of time? Will those consumers get
> all
> > > the
> > > > > > > messages, but they just may not be ordered correctly?
> > > > > > >
> > > > > > > The current scenario is that we will have events going into a
> > topic
> > > > > based
> > > > > > > on customer and the data will remain in the topic for 24 hours.
> > We
> > > > will
> > > > > > > then have multiple consumers reading messages from that topic.
> > They
> > > > > will
> > > > > > > want to be able to get them out over a time range (could be
> last
> > > > hour,
> > > > > > last
> > > > > > > 8 hours etc).
> > > > > > >
> > > > > > > So if I specify the same PK for each subscriber, then each
> > consumer
> > > > > will
> > > > > > > get all messages in the correct order?  If I don't specify the
> PK
> > > or
> > > > > use
> > > > > > a
> > > > > > > random one, will each consumer still get all the messages but
> > they
> > > > just
> > > > > > > won't be ordered correctly?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: understanding partition key

Posted by David McNelis <dm...@emergingthreats.net>.
I'm going to go a bit in reverse for your questions. We built a restful API
to push data to so that we could submit things from multiple sources that
aren't necessarily things that our team would maintain, as well as validate
that data before we send it off to a topic.

As for consumers... we expect that a consumer will receive everything in a
particular partition, but definitely not everything in a topic.  For
example, for DNS information, a consumer expects to see everything for a
domain (our partition key there), but not for all domains... it would be
too much data to handle with a low enough latency.

For us, our latencies vary by what we're processing at a given moment or
other large jobs that might be running outside of the stream processing.
Generally speaking we keep latency within our limits by adjusting the
number of partitions for a topic, and subsequently the number of
consumers...which works until you hit a wall with your data store, in which
case tweaking must happen there too.

Your flow of data will be the biggest impact on latency (along with batch
size)....  If you see 100 events a minute, and have a batch size of 1k,
you'll have 10 minutes of latency... if you process your data as each
record comes in, then as long as your consumer can keep up with that load,
you'll have very low latency.

Kafka's latency has never been an issue for us, and we've always found
there's something that we're doing that is affecting the overall throughput
of the systems, be it needing to play with the number of partitions,
adjusting batch size, resizing the hadoop cluster to meet increased need
there.

On Thu, Feb 12, 2015 at 9:54 AM, Gary Ogden <go...@gmail.com> wrote:

> Thanks again David. So what kind of latencies are you experiencing with
> this? If I wanted to act upon certain events in this and send out alarms
> (email, sms etc), what kind of delays are you seeing by the time you're
> able to process them?
>
> It seems if you were to create an alarm topic, and dump alerts on there to
> be processed, and have a consumer then process those alerts (save to
> Cassandra and send out the notification), you could see some sort of delay.
>
> But none of you consumers are expecting that they will be getting all the
> data for that topic, are they? That's the hitch for me. I think I could
> rework the design so that no consumer would assume it's getting all the
> messages in the topic.
>
> When you say central producer stack, this is something you built outside of
> kafka I assume.
>
> On 12 February 2015 at 09:40, David McNelis <dm...@emergingthreats.net>
> wrote:
>
> > In our setup we deal with a similar situation (lots of time-series data
> > that we have to aggregate in a number of different ways).
> >
> > Our approach is to push all of our data to a central producer stack, that
> > stack then submits data to different topics, depending on a set of
> > predetermined rules.
> >
> > For arguments sake, lets say we see 100 events / second... Once the data
> is
> > pushed to topics we have several consumer applications.  The first pushes
> > the lowest level raw data into our cluster (in our case HBase, but could
> > just as easily be C*). This first consumer we use a small batch size....
> > say 100 records.
> >
> > Our second consumer does some aggregation and summarization and needs to
> > reference additional resources.  This batch size is considerably larger,
> > say 1000 records in size.
> >
> > Our third consumer does a much larger aggregation where we're reducing it
> > down to a much smaller dataset, by orders of magnitude, and we'll run
> that
> > batch size at 10k records.
> >
> > Our aggregations are oftentimes time based, like rolling events /
> counters
> > to the hour or day level... so while we may occasionally re-process some
> of
> > the same information throughout the day, it lets us maintain a system
> where
> > we have near-real-time access to most of the data we're ingesting.
> >
> > This certainly is something we've had to tweak in terms of the numbers of
> > consumers / partitions and batch sizes to get to work optimally, but
> we're
> > pretty happy with it at the moment.
> >
> > On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <go...@gmail.com> wrote:
> >
> > > Thanks David. Whether Kafka is the right choice is exactly what I'm
> > trying
> > > to determine.  Everything I want to do with these events is time based.
> > > Store them in the topic for 24 hours. Read from the topics and get data
> > for
> > > a time period (last hour , last 8 hours etc). This reading from the
> > topics
> > > could happen numerous times for the same dataset.  At the end of the 24
> > > hours, we would then want to store the events in long term storage in
> > case
> > > we need them in the future.
> > >
> > > I'm thinking Kafka isn't the right fit for this use case though.
> > However,
> > > we have cassandra already installed and running. Maybe we use kafka as
> > the
> > > in-between... So dump the events onto kafka, have consumers that
> process
> > > the events and dump them into Cassandra and then we read the the time
> > > series data we need from Cassandra and not kafka.
> > >
> > > My concern with this method would be two things:
> > > 1 - the time delay of dumping them on the queue and then processing
> them
> > > into Cassandra before the events are available.
> > > 2 - Is there any need for kafka now? Why can't the code that puts the
> > > messages on the topic just directly insert into Cassandra then and
> avoid
> > > this extra hop?
> > >
> > >
> > >
> > > On 12 February 2015 at 09:01, David McNelis <
> > dmcnelis@emergingthreats.net>
> > > wrote:
> > >
> > > > Gary,
> > > >
> > > > That is certainly a valid use case.  What Zijing was saying is that
> you
> > > can
> > > > only have 1 consumer per consumer application per partition.
> > > >
> > > > I think that what it boils down to is how you want your information
> > > grouped
> > > > inside your timeframes.  For example, if you want to have everything
> > for
> > > a
> > > > specific user, then you could use that as your partition key,
> ensuring
> > > that
> > > > any data for that user is processed by the same consumer (in however
> > many
> > > > consumer applications you opt to run).
> > > >
> > > > The second point that I think Zijing was getting at was whether or
> not
> > > your
> > > > proposed use case makes sense for Kafka.  If your goal is to do
> > > > time-interval batch processing (versus N-record batches), then why
> use
> > > > Kafka for it?  Why not use something more adept at batch processing?
> > For
> > > > example, if you're using HBase you can use Pig jobs that would read
> > only
> > > > the records created between specific timestamps.
> > > >
> > > > David
> > > >
> > > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com>
> wrote:
> > > >
> > > > > So it's not possible to have 1 topic with 1 partition and many
> > > consumers
> > > > of
> > > > > that topic?
> > > > >
> > > > > My intention is to have a topic with many consumers, but each
> > consumer
> > > > > needs to be able to have access to all the messages in that topic.
> > > > >
> > > > > On 11 February 2015 at 20:42, Zijing Guo
> <altergzj@yahoo.com.invalid
> > >
> > > > > wrote:
> > > > >
> > > > > > Partition key is on producer level, that if you have multiple
> > > > partitions
> > > > > > for a single topic, then you can pass in a key for the
> KeyedMessage
> > > > > object,
> > > > > > and base on different partition.class, it will return a partition
> > > > number
> > > > > > for the producer, and producer will find the leader for that
> > > > partition.I
> > > > > > don't know how kafka could handle time series case, but depends
> on
> > > how
> > > > > many
> > > > > > partitions for that topic. If you only have 1 partition, then you
> > > don't
> > > > > > need to worry about order at all, since each consumer group can
> > only
> > > > > allow
> > > > > > 1 consumer instance to consume that data.  if you have multiple
> > > > > partitions
> > > > > > (say 3 for example), then you can fire up 3 consumer instances
> > under
> > > > the
> > > > > > same consumer group, and each will only consume 1 partition's
> data.
> > > if
> > > > > > order in each partition matters, then you need to do some work on
> > the
> > > > > > producer side.Hope this helpsEdwin
> > > > > >
> > > > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > > > gogden@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > >
> > > > > >  I'm trying to understand how the partition key works and
> whether I
> > > > need
> > > > > to
> > > > > > specify a partition key for my topics or not.  What happens if I
> > > don't
> > > > > > specify a PK and I have more than one consumer that wants all
> > > messages
> > > > > in a
> > > > > > topic for a certain period of time? Will those consumers get all
> > the
> > > > > > messages, but they just may not be ordered correctly?
> > > > > >
> > > > > > The current scenario is that we will have events going into a
> topic
> > > > based
> > > > > > on customer and the data will remain in the topic for 24 hours.
> We
> > > will
> > > > > > then have multiple consumers reading messages from that topic.
> They
> > > > will
> > > > > > want to be able to get them out over a time range (could be last
> > > hour,
> > > > > last
> > > > > > 8 hours etc).
> > > > > >
> > > > > > So if I specify the same PK for each subscriber, then each
> consumer
> > > > will
> > > > > > get all messages in the correct order?  If I don't specify the PK
> > or
> > > > use
> > > > > a
> > > > > > random one, will each consumer still get all the messages but
> they
> > > just
> > > > > > won't be ordered correctly?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: understanding partition key

Posted by Gary Ogden <go...@gmail.com>.
Thanks again David. So what kind of latencies are you experiencing with
this? If I wanted to act upon certain events in this and send out alarms
(email, sms etc), what kind of delays are you seeing by the time you're
able to process them?

It seems if you were to create an alarm topic, and dump alerts on there to
be processed, and have a consumer then process those alerts (save to
Cassandra and send out the notification), you could see some sort of delay.

But none of you consumers are expecting that they will be getting all the
data for that topic, are they? That's the hitch for me. I think I could
rework the design so that no consumer would assume it's getting all the
messages in the topic.

When you say central producer stack, this is something you built outside of
kafka I assume.

On 12 February 2015 at 09:40, David McNelis <dm...@emergingthreats.net>
wrote:

> In our setup we deal with a similar situation (lots of time-series data
> that we have to aggregate in a number of different ways).
>
> Our approach is to push all of our data to a central producer stack, that
> stack then submits data to different topics, depending on a set of
> predetermined rules.
>
> For arguments sake, lets say we see 100 events / second... Once the data is
> pushed to topics we have several consumer applications.  The first pushes
> the lowest level raw data into our cluster (in our case HBase, but could
> just as easily be C*). This first consumer we use a small batch size....
> say 100 records.
>
> Our second consumer does some aggregation and summarization and needs to
> reference additional resources.  This batch size is considerably larger,
> say 1000 records in size.
>
> Our third consumer does a much larger aggregation where we're reducing it
> down to a much smaller dataset, by orders of magnitude, and we'll run that
> batch size at 10k records.
>
> Our aggregations are oftentimes time based, like rolling events / counters
> to the hour or day level... so while we may occasionally re-process some of
> the same information throughout the day, it lets us maintain a system where
> we have near-real-time access to most of the data we're ingesting.
>
> This certainly is something we've had to tweak in terms of the numbers of
> consumers / partitions and batch sizes to get to work optimally, but we're
> pretty happy with it at the moment.
>
> On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <go...@gmail.com> wrote:
>
> > Thanks David. Whether Kafka is the right choice is exactly what I'm
> trying
> > to determine.  Everything I want to do with these events is time based.
> > Store them in the topic for 24 hours. Read from the topics and get data
> for
> > a time period (last hour , last 8 hours etc). This reading from the
> topics
> > could happen numerous times for the same dataset.  At the end of the 24
> > hours, we would then want to store the events in long term storage in
> case
> > we need them in the future.
> >
> > I'm thinking Kafka isn't the right fit for this use case though.
> However,
> > we have cassandra already installed and running. Maybe we use kafka as
> the
> > in-between... So dump the events onto kafka, have consumers that process
> > the events and dump them into Cassandra and then we read the the time
> > series data we need from Cassandra and not kafka.
> >
> > My concern with this method would be two things:
> > 1 - the time delay of dumping them on the queue and then processing them
> > into Cassandra before the events are available.
> > 2 - Is there any need for kafka now? Why can't the code that puts the
> > messages on the topic just directly insert into Cassandra then and avoid
> > this extra hop?
> >
> >
> >
> > On 12 February 2015 at 09:01, David McNelis <
> dmcnelis@emergingthreats.net>
> > wrote:
> >
> > > Gary,
> > >
> > > That is certainly a valid use case.  What Zijing was saying is that you
> > can
> > > only have 1 consumer per consumer application per partition.
> > >
> > > I think that what it boils down to is how you want your information
> > grouped
> > > inside your timeframes.  For example, if you want to have everything
> for
> > a
> > > specific user, then you could use that as your partition key, ensuring
> > that
> > > any data for that user is processed by the same consumer (in however
> many
> > > consumer applications you opt to run).
> > >
> > > The second point that I think Zijing was getting at was whether or not
> > your
> > > proposed use case makes sense for Kafka.  If your goal is to do
> > > time-interval batch processing (versus N-record batches), then why use
> > > Kafka for it?  Why not use something more adept at batch processing?
> For
> > > example, if you're using HBase you can use Pig jobs that would read
> only
> > > the records created between specific timestamps.
> > >
> > > David
> > >
> > > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com> wrote:
> > >
> > > > So it's not possible to have 1 topic with 1 partition and many
> > consumers
> > > of
> > > > that topic?
> > > >
> > > > My intention is to have a topic with many consumers, but each
> consumer
> > > > needs to be able to have access to all the messages in that topic.
> > > >
> > > > On 11 February 2015 at 20:42, Zijing Guo <altergzj@yahoo.com.invalid
> >
> > > > wrote:
> > > >
> > > > > Partition key is on producer level, that if you have multiple
> > > partitions
> > > > > for a single topic, then you can pass in a key for the KeyedMessage
> > > > object,
> > > > > and base on different partition.class, it will return a partition
> > > number
> > > > > for the producer, and producer will find the leader for that
> > > partition.I
> > > > > don't know how kafka could handle time series case, but depends on
> > how
> > > > many
> > > > > partitions for that topic. If you only have 1 partition, then you
> > don't
> > > > > need to worry about order at all, since each consumer group can
> only
> > > > allow
> > > > > 1 consumer instance to consume that data.  if you have multiple
> > > > partitions
> > > > > (say 3 for example), then you can fire up 3 consumer instances
> under
> > > the
> > > > > same consumer group, and each will only consume 1 partition's data.
> > if
> > > > > order in each partition matters, then you need to do some work on
> the
> > > > > producer side.Hope this helpsEdwin
> > > > >
> > > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > > gogden@gmail.com>
> > > > > wrote:
> > > > >
> > > > >
> > > > >  I'm trying to understand how the partition key works and whether I
> > > need
> > > > to
> > > > > specify a partition key for my topics or not.  What happens if I
> > don't
> > > > > specify a PK and I have more than one consumer that wants all
> > messages
> > > > in a
> > > > > topic for a certain period of time? Will those consumers get all
> the
> > > > > messages, but they just may not be ordered correctly?
> > > > >
> > > > > The current scenario is that we will have events going into a topic
> > > based
> > > > > on customer and the data will remain in the topic for 24 hours. We
> > will
> > > > > then have multiple consumers reading messages from that topic. They
> > > will
> > > > > want to be able to get them out over a time range (could be last
> > hour,
> > > > last
> > > > > 8 hours etc).
> > > > >
> > > > > So if I specify the same PK for each subscriber, then each consumer
> > > will
> > > > > get all messages in the correct order?  If I don't specify the PK
> or
> > > use
> > > > a
> > > > > random one, will each consumer still get all the messages but they
> > just
> > > > > won't be ordered correctly?
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: understanding partition key

Posted by David McNelis <dm...@emergingthreats.net>.
In our setup we deal with a similar situation (lots of time-series data
that we have to aggregate in a number of different ways).

Our approach is to push all of our data to a central producer stack, that
stack then submits data to different topics, depending on a set of
predetermined rules.

For arguments sake, lets say we see 100 events / second... Once the data is
pushed to topics we have several consumer applications.  The first pushes
the lowest level raw data into our cluster (in our case HBase, but could
just as easily be C*). This first consumer we use a small batch size....
say 100 records.

Our second consumer does some aggregation and summarization and needs to
reference additional resources.  This batch size is considerably larger,
say 1000 records in size.

Our third consumer does a much larger aggregation where we're reducing it
down to a much smaller dataset, by orders of magnitude, and we'll run that
batch size at 10k records.

Our aggregations are oftentimes time based, like rolling events / counters
to the hour or day level... so while we may occasionally re-process some of
the same information throughout the day, it lets us maintain a system where
we have near-real-time access to most of the data we're ingesting.

This certainly is something we've had to tweak in terms of the numbers of
consumers / partitions and batch sizes to get to work optimally, but we're
pretty happy with it at the moment.

On Thu, Feb 12, 2015 at 8:22 AM, Gary Ogden <go...@gmail.com> wrote:

> Thanks David. Whether Kafka is the right choice is exactly what I'm trying
> to determine.  Everything I want to do with these events is time based.
> Store them in the topic for 24 hours. Read from the topics and get data for
> a time period (last hour , last 8 hours etc). This reading from the topics
> could happen numerous times for the same dataset.  At the end of the 24
> hours, we would then want to store the events in long term storage in case
> we need them in the future.
>
> I'm thinking Kafka isn't the right fit for this use case though.  However,
> we have cassandra already installed and running. Maybe we use kafka as the
> in-between... So dump the events onto kafka, have consumers that process
> the events and dump them into Cassandra and then we read the the time
> series data we need from Cassandra and not kafka.
>
> My concern with this method would be two things:
> 1 - the time delay of dumping them on the queue and then processing them
> into Cassandra before the events are available.
> 2 - Is there any need for kafka now? Why can't the code that puts the
> messages on the topic just directly insert into Cassandra then and avoid
> this extra hop?
>
>
>
> On 12 February 2015 at 09:01, David McNelis <dm...@emergingthreats.net>
> wrote:
>
> > Gary,
> >
> > That is certainly a valid use case.  What Zijing was saying is that you
> can
> > only have 1 consumer per consumer application per partition.
> >
> > I think that what it boils down to is how you want your information
> grouped
> > inside your timeframes.  For example, if you want to have everything for
> a
> > specific user, then you could use that as your partition key, ensuring
> that
> > any data for that user is processed by the same consumer (in however many
> > consumer applications you opt to run).
> >
> > The second point that I think Zijing was getting at was whether or not
> your
> > proposed use case makes sense for Kafka.  If your goal is to do
> > time-interval batch processing (versus N-record batches), then why use
> > Kafka for it?  Why not use something more adept at batch processing?  For
> > example, if you're using HBase you can use Pig jobs that would read only
> > the records created between specific timestamps.
> >
> > David
> >
> > On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com> wrote:
> >
> > > So it's not possible to have 1 topic with 1 partition and many
> consumers
> > of
> > > that topic?
> > >
> > > My intention is to have a topic with many consumers, but each consumer
> > > needs to be able to have access to all the messages in that topic.
> > >
> > > On 11 February 2015 at 20:42, Zijing Guo <al...@yahoo.com.invalid>
> > > wrote:
> > >
> > > > Partition key is on producer level, that if you have multiple
> > partitions
> > > > for a single topic, then you can pass in a key for the KeyedMessage
> > > object,
> > > > and base on different partition.class, it will return a partition
> > number
> > > > for the producer, and producer will find the leader for that
> > partition.I
> > > > don't know how kafka could handle time series case, but depends on
> how
> > > many
> > > > partitions for that topic. If you only have 1 partition, then you
> don't
> > > > need to worry about order at all, since each consumer group can only
> > > allow
> > > > 1 consumer instance to consume that data.  if you have multiple
> > > partitions
> > > > (say 3 for example), then you can fire up 3 consumer instances under
> > the
> > > > same consumer group, and each will only consume 1 partition's data.
> if
> > > > order in each partition matters, then you need to do some work on the
> > > > producer side.Hope this helpsEdwin
> > > >
> > > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > > gogden@gmail.com>
> > > > wrote:
> > > >
> > > >
> > > >  I'm trying to understand how the partition key works and whether I
> > need
> > > to
> > > > specify a partition key for my topics or not.  What happens if I
> don't
> > > > specify a PK and I have more than one consumer that wants all
> messages
> > > in a
> > > > topic for a certain period of time? Will those consumers get all the
> > > > messages, but they just may not be ordered correctly?
> > > >
> > > > The current scenario is that we will have events going into a topic
> > based
> > > > on customer and the data will remain in the topic for 24 hours. We
> will
> > > > then have multiple consumers reading messages from that topic. They
> > will
> > > > want to be able to get them out over a time range (could be last
> hour,
> > > last
> > > > 8 hours etc).
> > > >
> > > > So if I specify the same PK for each subscriber, then each consumer
> > will
> > > > get all messages in the correct order?  If I don't specify the PK or
> > use
> > > a
> > > > random one, will each consumer still get all the messages but they
> just
> > > > won't be ordered correctly?
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: understanding partition key

Posted by Todd Snyder <ts...@blackberry.com>.
,#,,,,,,,,,,

   Z. Z

Sent from the wilds on my BlackBerry smartphone.
  Original Message
From: Gary Ogden
Sent: Thursday, February 12, 2015 8:23 AM
To: users@kafka.apache.org
Reply To: users@kafka.apache.org
Subject: Re: understanding partition key


Thanks David. Whether Kafka is the right choice is exactly what I'm trying
to determine.  Everything I want to do with these events is time based.
Store them in the topic for 24 hours. Read from the topics and get data for
a time period (last hour , last 8 hours etc). This reading from the topics
could happen numerous times for the same dataset.  At the end of the 24
hours, we would then want to store the events in long term storage in case
we need them in the future.

I'm thinking Kafka isn't the right fit for this use case though.  However,
we have cassandra already installed and running. Maybe we use kafka as the
in-between... So dump the events onto kafka, have consumers that process
the events and dump them into Cassandra and then we read the the time
series data we need from Cassandra and not kafka.

My concern with this method would be two things:
1 - the time delay of dumping them on the queue and then processing them
into Cassandra before the events are available.
2 - Is there any need for kafka now? Why can't the code that puts the
messages on the topic just directly insert into Cassandra then and avoid
this extra hop?



On 12 February 2015 at 09:01, David McNelis <dm...@emergingthreats.net>
wrote:

> Gary,
>
> That is certainly a valid use case.  What Zijing was saying is that you can
> only have 1 consumer per consumer application per partition.
>
> I think that what it boils down to is how you want your information grouped
> inside your timeframes.  For example, if you want to have everything for a
> specific user, then you could use that as your partition key, ensuring that
> any data for that user is processed by the same consumer (in however many
> consumer applications you opt to run).
>
> The second point that I think Zijing was getting at was whether or not your
> proposed use case makes sense for Kafka.  If your goal is to do
> time-interval batch processing (versus N-record batches), then why use
> Kafka for it?  Why not use something more adept at batch processing?  For
> example, if you're using HBase you can use Pig jobs that would read only
> the records created between specific timestamps.
>
> David
>
> On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com> wrote:
>
> > So it's not possible to have 1 topic with 1 partition and many consumers
> of
> > that topic?
> >
> > My intention is to have a topic with many consumers, but each consumer
> > needs to be able to have access to all the messages in that topic.
> >
> > On 11 February 2015 at 20:42, Zijing Guo <al...@yahoo.com.invalid>
> > wrote:
> >
> > > Partition key is on producer level, that if you have multiple
> partitions
> > > for a single topic, then you can pass in a key for the KeyedMessage
> > object,
> > > and base on different partition.class, it will return a partition
> number
> > > for the producer, and producer will find the leader for that
> partition.I
> > > don't know how kafka could handle time series case, but depends on how
> > many
> > > partitions for that topic. If you only have 1 partition, then you don't
> > > need to worry about order at all, since each consumer group can only
> > allow
> > > 1 consumer instance to consume that data.  if you have multiple
> > partitions
> > > (say 3 for example), then you can fire up 3 consumer instances under
> the
> > > same consumer group, and each will only consume 1 partition's data. if
> > > order in each partition matters, then you need to do some work on the
> > > producer side.Hope this helpsEdwin
> > >
> > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > gogden@gmail.com>
> > > wrote:
> > >
> > >
> > >  I'm trying to understand how the partition key works and whether I
> need
> > to
> > > specify a partition key for my topics or not.  What happens if I don't
> > > specify a PK and I have more than one consumer that wants all messages
> > in a
> > > topic for a certain period of time? Will those consumers get all the
> > > messages, but they just may not be ordered correctly?
> > >
> > > The current scenario is that we will have events going into a topic
> based
> > > on customer and the data will remain in the topic for 24 hours. We will
> > > then have multiple consumers reading messages from that topic. They
> will
> > > want to be able to get them out over a time range (could be last hour,
> > last
> > > 8 hours etc).
> > >
> > > So if I specify the same PK for each subscriber, then each consumer
> will
> > > get all messages in the correct order?  If I don't specify the PK or
> use
> > a
> > > random one, will each consumer still get all the messages but they just
> > > won't be ordered correctly?
> > >
> > >
> > >
> > >
> >
>

Re: understanding partition key

Posted by Gary Ogden <go...@gmail.com>.
Thanks David. Whether Kafka is the right choice is exactly what I'm trying
to determine.  Everything I want to do with these events is time based.
Store them in the topic for 24 hours. Read from the topics and get data for
a time period (last hour , last 8 hours etc). This reading from the topics
could happen numerous times for the same dataset.  At the end of the 24
hours, we would then want to store the events in long term storage in case
we need them in the future.

I'm thinking Kafka isn't the right fit for this use case though.  However,
we have cassandra already installed and running. Maybe we use kafka as the
in-between... So dump the events onto kafka, have consumers that process
the events and dump them into Cassandra and then we read the the time
series data we need from Cassandra and not kafka.

My concern with this method would be two things:
1 - the time delay of dumping them on the queue and then processing them
into Cassandra before the events are available.
2 - Is there any need for kafka now? Why can't the code that puts the
messages on the topic just directly insert into Cassandra then and avoid
this extra hop?



On 12 February 2015 at 09:01, David McNelis <dm...@emergingthreats.net>
wrote:

> Gary,
>
> That is certainly a valid use case.  What Zijing was saying is that you can
> only have 1 consumer per consumer application per partition.
>
> I think that what it boils down to is how you want your information grouped
> inside your timeframes.  For example, if you want to have everything for a
> specific user, then you could use that as your partition key, ensuring that
> any data for that user is processed by the same consumer (in however many
> consumer applications you opt to run).
>
> The second point that I think Zijing was getting at was whether or not your
> proposed use case makes sense for Kafka.  If your goal is to do
> time-interval batch processing (versus N-record batches), then why use
> Kafka for it?  Why not use something more adept at batch processing?  For
> example, if you're using HBase you can use Pig jobs that would read only
> the records created between specific timestamps.
>
> David
>
> On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com> wrote:
>
> > So it's not possible to have 1 topic with 1 partition and many consumers
> of
> > that topic?
> >
> > My intention is to have a topic with many consumers, but each consumer
> > needs to be able to have access to all the messages in that topic.
> >
> > On 11 February 2015 at 20:42, Zijing Guo <al...@yahoo.com.invalid>
> > wrote:
> >
> > > Partition key is on producer level, that if you have multiple
> partitions
> > > for a single topic, then you can pass in a key for the KeyedMessage
> > object,
> > > and base on different partition.class, it will return a partition
> number
> > > for the producer, and producer will find the leader for that
> partition.I
> > > don't know how kafka could handle time series case, but depends on how
> > many
> > > partitions for that topic. If you only have 1 partition, then you don't
> > > need to worry about order at all, since each consumer group can only
> > allow
> > > 1 consumer instance to consume that data.  if you have multiple
> > partitions
> > > (say 3 for example), then you can fire up 3 consumer instances under
> the
> > > same consumer group, and each will only consume 1 partition's data. if
> > > order in each partition matters, then you need to do some work on the
> > > producer side.Hope this helpsEdwin
> > >
> > >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> > gogden@gmail.com>
> > > wrote:
> > >
> > >
> > >  I'm trying to understand how the partition key works and whether I
> need
> > to
> > > specify a partition key for my topics or not.  What happens if I don't
> > > specify a PK and I have more than one consumer that wants all messages
> > in a
> > > topic for a certain period of time? Will those consumers get all the
> > > messages, but they just may not be ordered correctly?
> > >
> > > The current scenario is that we will have events going into a topic
> based
> > > on customer and the data will remain in the topic for 24 hours. We will
> > > then have multiple consumers reading messages from that topic. They
> will
> > > want to be able to get them out over a time range (could be last hour,
> > last
> > > 8 hours etc).
> > >
> > > So if I specify the same PK for each subscriber, then each consumer
> will
> > > get all messages in the correct order?  If I don't specify the PK or
> use
> > a
> > > random one, will each consumer still get all the messages but they just
> > > won't be ordered correctly?
> > >
> > >
> > >
> > >
> >
>

Re: understanding partition key

Posted by David McNelis <dm...@emergingthreats.net>.
Gary,

That is certainly a valid use case.  What Zijing was saying is that you can
only have 1 consumer per consumer application per partition.

I think that what it boils down to is how you want your information grouped
inside your timeframes.  For example, if you want to have everything for a
specific user, then you could use that as your partition key, ensuring that
any data for that user is processed by the same consumer (in however many
consumer applications you opt to run).

The second point that I think Zijing was getting at was whether or not your
proposed use case makes sense for Kafka.  If your goal is to do
time-interval batch processing (versus N-record batches), then why use
Kafka for it?  Why not use something more adept at batch processing?  For
example, if you're using HBase you can use Pig jobs that would read only
the records created between specific timestamps.

David

On Thu, Feb 12, 2015 at 7:44 AM, Gary Ogden <go...@gmail.com> wrote:

> So it's not possible to have 1 topic with 1 partition and many consumers of
> that topic?
>
> My intention is to have a topic with many consumers, but each consumer
> needs to be able to have access to all the messages in that topic.
>
> On 11 February 2015 at 20:42, Zijing Guo <al...@yahoo.com.invalid>
> wrote:
>
> > Partition key is on producer level, that if you have multiple partitions
> > for a single topic, then you can pass in a key for the KeyedMessage
> object,
> > and base on different partition.class, it will return a partition number
> > for the producer, and producer will find the leader for that partition.I
> > don't know how kafka could handle time series case, but depends on how
> many
> > partitions for that topic. If you only have 1 partition, then you don't
> > need to worry about order at all, since each consumer group can only
> allow
> > 1 consumer instance to consume that data.  if you have multiple
> partitions
> > (say 3 for example), then you can fire up 3 consumer instances under the
> > same consumer group, and each will only consume 1 partition's data. if
> > order in each partition matters, then you need to do some work on the
> > producer side.Hope this helpsEdwin
> >
> >      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <
> gogden@gmail.com>
> > wrote:
> >
> >
> >  I'm trying to understand how the partition key works and whether I need
> to
> > specify a partition key for my topics or not.  What happens if I don't
> > specify a PK and I have more than one consumer that wants all messages
> in a
> > topic for a certain period of time? Will those consumers get all the
> > messages, but they just may not be ordered correctly?
> >
> > The current scenario is that we will have events going into a topic based
> > on customer and the data will remain in the topic for 24 hours. We will
> > then have multiple consumers reading messages from that topic. They will
> > want to be able to get them out over a time range (could be last hour,
> last
> > 8 hours etc).
> >
> > So if I specify the same PK for each subscriber, then each consumer will
> > get all messages in the correct order?  If I don't specify the PK or use
> a
> > random one, will each consumer still get all the messages but they just
> > won't be ordered correctly?
> >
> >
> >
> >
>

Re: understanding partition key

Posted by Gary Ogden <go...@gmail.com>.
So it's not possible to have 1 topic with 1 partition and many consumers of
that topic?

My intention is to have a topic with many consumers, but each consumer
needs to be able to have access to all the messages in that topic.

On 11 February 2015 at 20:42, Zijing Guo <al...@yahoo.com.invalid> wrote:

> Partition key is on producer level, that if you have multiple partitions
> for a single topic, then you can pass in a key for the KeyedMessage object,
> and base on different partition.class, it will return a partition number
> for the producer, and producer will find the leader for that partition.I
> don't know how kafka could handle time series case, but depends on how many
> partitions for that topic. If you only have 1 partition, then you don't
> need to worry about order at all, since each consumer group can only allow
> 1 consumer instance to consume that data.  if you have multiple partitions
> (say 3 for example), then you can fire up 3 consumer instances under the
> same consumer group, and each will only consume 1 partition's data. if
> order in each partition matters, then you need to do some work on the
> producer side.Hope this helpsEdwin
>
>      On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <go...@gmail.com>
> wrote:
>
>
>  I'm trying to understand how the partition key works and whether I need to
> specify a partition key for my topics or not.  What happens if I don't
> specify a PK and I have more than one consumer that wants all messages in a
> topic for a certain period of time? Will those consumers get all the
> messages, but they just may not be ordered correctly?
>
> The current scenario is that we will have events going into a topic based
> on customer and the data will remain in the topic for 24 hours. We will
> then have multiple consumers reading messages from that topic. They will
> want to be able to get them out over a time range (could be last hour, last
> 8 hours etc).
>
> So if I specify the same PK for each subscriber, then each consumer will
> get all messages in the correct order?  If I don't specify the PK or use a
> random one, will each consumer still get all the messages but they just
> won't be ordered correctly?
>
>
>
>

Re: understanding partition key

Posted by Zijing Guo <al...@yahoo.com.INVALID>.
Partition key is on producer level, that if you have multiple partitions for a single topic, then you can pass in a key for the KeyedMessage object, and base on different partition.class, it will return a partition number for the producer, and producer will find the leader for that partition.I don't know how kafka could handle time series case, but depends on how many partitions for that topic. If you only have 1 partition, then you don't need to worry about order at all, since each consumer group can only allow 1 consumer instance to consume that data.  if you have multiple partitions (say 3 for example), then you can fire up 3 consumer instances under the same consumer group, and each will only consume 1 partition's data. if order in each partition matters, then you need to do some work on the producer side.Hope this helpsEdwin 

     On Wednesday, February 11, 2015 3:14 PM, Gary Ogden <go...@gmail.com> wrote:
   

 I'm trying to understand how the partition key works and whether I need to
specify a partition key for my topics or not.  What happens if I don't
specify a PK and I have more than one consumer that wants all messages in a
topic for a certain period of time? Will those consumers get all the
messages, but they just may not be ordered correctly?

The current scenario is that we will have events going into a topic based
on customer and the data will remain in the topic for 24 hours. We will
then have multiple consumers reading messages from that topic. They will
want to be able to get them out over a time range (could be last hour, last
8 hours etc).

So if I specify the same PK for each subscriber, then each consumer will
get all messages in the correct order?  If I don't specify the PK or use a
random one, will each consumer still get all the messages but they just
won't be ordered correctly?