You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Dotan Patrich <do...@fortscale.com> on 2015/02/26 07:09:38 UTC

Samza checkpoints in ZK

Hi,

I was looking for a quick and easy way to monitor tasks offsets and
stumbled upon this utility: https://github.com/quantifind/KafkaOffsetMonitor

It didn't work for me and what I discovered is that it they apparently look
for the consumers list and offsets in zookeeper, while Samza stores those
in a kafka topic.
I tried to think what could be the down sides of using zookeeper to store
offsets (performance?) but didn't had anything solid that came to mind.

I guess you guys had some discussions regarding this in the past, What
would be the pros/cons for storing the offsets in a kafka topic instead of
zookeeper?


Thanks,
Dotan

Re: Samza checkpoints in ZK

Posted by Chris Riccomini <cr...@apache.org>.
Hey Dotan,

Samza has the checkpoint-tool.sh script, which can be used to read
checkpoints for a given task. The MetricsSnapshotReporter can also be used
to read metrics from a Samza job to check its offset progress.

I don't believe that there's anything on the OS side that's plug and play,
but you could hook up Samza to a metrics system (Graphite, etc) and do
metrics/monitoring that way.

A simple hack is to use MetricsSnapshotReporter, and then run
kafka-console-consumer.sh to read the JSON blobs, and parse the metrics
that way

Cheers,
Chris

On Thu, Feb 26, 2015 at 11:03 AM, Dotan Patrich <do...@fortscale.com>
wrote:

> Hi Chris,
>
> Thanks for the info! very helpful!
> Seems very reasonable, by the way, it all started when I was looking for
> some open source monitoring tool for Samza/Kafka to see which tasks are the
> bottleneck in terms of performance. Do you have any experience with such a
> tool (other than the internal solution developed at LinkedIn)?
>  On 26 Feb 2015 20:11, "Chris Riccomini" <cr...@apache.org> wrote:
>
> > Hey Dotan,
> >
> > The high-level (ZK-based) Kafka consumer (not Samza's) currently uses ZK
> to
> > store offsets. They (Kafka) are moving away from this when they re-write
> > their new NIO-based consumer. They will adopt the strategy of storing
> > offsets in a Kafka topic, just like Samza has for years.
> >
> > The main motivation for not storing offsets in ZK is that it imposes
> > artificial limits on how often you can checkpoint due to ZK scalability.
> > For example, if you wanted to checkpoint your offsets after every
> message,
> > you would hammer away on ZK with thousands of writers per-second, just
> for
> > one consumer. Multiple this out by 100s or 1000s of consumers, and the ZK
> > grid would never be able to keep up. Kafka is actually really good at
> > exactly this kind of workload. In general, using ZK as a KV store is not
> a
> > great idea.
> >
> > The other benefit of storing offsets in Kafka is that it means Samza
> > doesn't directly depend on ZK (just transitively, through Kafka). This
> > should make operating Samza easier.
> >
> > Cheers,
> > Chris
> >
> > On Wed, Feb 25, 2015 at 10:09 PM, Dotan Patrich <do...@fortscale.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I was looking for a quick and easy way to monitor tasks offsets and
> > > stumbled upon this utility:
> > > https://github.com/quantifind/KafkaOffsetMonitor
> > >
> > > It didn't work for me and what I discovered is that it they apparently
> > look
> > > for the consumers list and offsets in zookeeper, while Samza stores
> those
> > > in a kafka topic.
> > > I tried to think what could be the down sides of using zookeeper to
> store
> > > offsets (performance?) but didn't had anything solid that came to mind.
> > >
> > > I guess you guys had some discussions regarding this in the past, What
> > > would be the pros/cons for storing the offsets in a kafka topic instead
> > of
> > > zookeeper?
> > >
> > >
> > > Thanks,
> > > Dotan
> > >
> >
>

Re: Samza checkpoints in ZK

Posted by Dotan Patrich <do...@fortscale.com>.
Hi Chris,

Thanks for the info! very helpful!
Seems very reasonable, by the way, it all started when I was looking for
some open source monitoring tool for Samza/Kafka to see which tasks are the
bottleneck in terms of performance. Do you have any experience with such a
tool (other than the internal solution developed at LinkedIn)?
 On 26 Feb 2015 20:11, "Chris Riccomini" <cr...@apache.org> wrote:

> Hey Dotan,
>
> The high-level (ZK-based) Kafka consumer (not Samza's) currently uses ZK to
> store offsets. They (Kafka) are moving away from this when they re-write
> their new NIO-based consumer. They will adopt the strategy of storing
> offsets in a Kafka topic, just like Samza has for years.
>
> The main motivation for not storing offsets in ZK is that it imposes
> artificial limits on how often you can checkpoint due to ZK scalability.
> For example, if you wanted to checkpoint your offsets after every message,
> you would hammer away on ZK with thousands of writers per-second, just for
> one consumer. Multiple this out by 100s or 1000s of consumers, and the ZK
> grid would never be able to keep up. Kafka is actually really good at
> exactly this kind of workload. In general, using ZK as a KV store is not a
> great idea.
>
> The other benefit of storing offsets in Kafka is that it means Samza
> doesn't directly depend on ZK (just transitively, through Kafka). This
> should make operating Samza easier.
>
> Cheers,
> Chris
>
> On Wed, Feb 25, 2015 at 10:09 PM, Dotan Patrich <do...@fortscale.com>
> wrote:
>
> > Hi,
> >
> > I was looking for a quick and easy way to monitor tasks offsets and
> > stumbled upon this utility:
> > https://github.com/quantifind/KafkaOffsetMonitor
> >
> > It didn't work for me and what I discovered is that it they apparently
> look
> > for the consumers list and offsets in zookeeper, while Samza stores those
> > in a kafka topic.
> > I tried to think what could be the down sides of using zookeeper to store
> > offsets (performance?) but didn't had anything solid that came to mind.
> >
> > I guess you guys had some discussions regarding this in the past, What
> > would be the pros/cons for storing the offsets in a kafka topic instead
> of
> > zookeeper?
> >
> >
> > Thanks,
> > Dotan
> >
>

Re: Samza checkpoints in ZK

Posted by Chris Riccomini <cr...@apache.org>.
Hey Dotan,

The high-level (ZK-based) Kafka consumer (not Samza's) currently uses ZK to
store offsets. They (Kafka) are moving away from this when they re-write
their new NIO-based consumer. They will adopt the strategy of storing
offsets in a Kafka topic, just like Samza has for years.

The main motivation for not storing offsets in ZK is that it imposes
artificial limits on how often you can checkpoint due to ZK scalability.
For example, if you wanted to checkpoint your offsets after every message,
you would hammer away on ZK with thousands of writers per-second, just for
one consumer. Multiple this out by 100s or 1000s of consumers, and the ZK
grid would never be able to keep up. Kafka is actually really good at
exactly this kind of workload. In general, using ZK as a KV store is not a
great idea.

The other benefit of storing offsets in Kafka is that it means Samza
doesn't directly depend on ZK (just transitively, through Kafka). This
should make operating Samza easier.

Cheers,
Chris

On Wed, Feb 25, 2015 at 10:09 PM, Dotan Patrich <do...@fortscale.com>
wrote:

> Hi,
>
> I was looking for a quick and easy way to monitor tasks offsets and
> stumbled upon this utility:
> https://github.com/quantifind/KafkaOffsetMonitor
>
> It didn't work for me and what I discovered is that it they apparently look
> for the consumers list and offsets in zookeeper, while Samza stores those
> in a kafka topic.
> I tried to think what could be the down sides of using zookeeper to store
> offsets (performance?) but didn't had anything solid that came to mind.
>
> I guess you guys had some discussions regarding this in the past, What
> would be the pros/cons for storing the offsets in a kafka topic instead of
> zookeeper?
>
>
> Thanks,
> Dotan
>