You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Eduardo Costa Alfaia <e....@unibs.it> on 2015/02/07 12:01:59 UTC

Doubts Kafka

Hi Guys,

I have some doubts about the Kafka, the first is Why sometimes the applications prefer to connect to zookeeper instead brokers? Connecting to zookeeper could create an overhead, because we are inserting other element between producer and consumer. Another question is about the information sent by producer, in my tests the producer send the messages to brokers and a few minutes my HardDisk is full (my harddisk has 250GB), is there something to do in the configuration to minimize this?

Thanks 
-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155

Re: Doubts Kafka

Posted by Christopher Piggott <cp...@gmail.com>.
Sorry, I should have read the release notes before I asked this question.
The answer was in there.

Internally the implementation of the offset storage is just a compacted
> <http://kafka.apache.org/documentation.html#compaction> Kafka topic (
> __consumer_offsets) keyed on the consumer’s group, topic, and partition.
> The offset commit request writes the offset to the compacted Kafka topic
> using the highest level of durability guarantee that Kafka provides (
> acks=-1) so that offsets are never lost in the presence of uncorrelated
> failures. Kafka maintains an in-memory view of the latest offset per
> <consumer group, topic, partition> triplet, so offset fetch requests can be
> served quickly without requiring a full scan of the compacted offsets
> topic. With this feature, consumers can checkpoint offsets very often,
> possibly per message.



On Sun, Feb 8, 2015 at 9:39 AM, Christopher Piggott <cp...@gmail.com>
wrote:

> > The consumer used Zookeeper to store offsets, in 0.8.2 there's an option
> to use Kafka itself for that (by setting *offsets.storage = kafka
>
> Does it still really live in zookeeper, and kafka is proxying the requests
> through?
>
> On Sun, Feb 8, 2015 at 9:25 AM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
>> Hi Eduardo,
>>
>> 1. "Why sometimes the applications prefer to connect to zookeeper instead
>> brokers?"
>>
>> I assume you are talking about the clients and some of our tools?
>> These are parts of an older design and we are actively working on fixing
>> this. The consumer used Zookeeper to store offsets, in 0.8.2 there's an
>> option to use Kafka itself for that (by setting *offsets.storage =
>> kafka*).
>> We are planning on fixing the tools in 0.9, but obviously they are less
>> performance sensitive than the consumers.
>>
>> 2. Regarding your tests and disk usage - I'm not sure exactly what fills
>> your disk - if its the kafka transaction logs (i.e. log.dir), then we
>> expect to store the size of all messages sent times the replication
>> faction
>> configured for each topic. We keep messages for the amount of time
>> specified in *log.retention* parameters. If the disk is filled within
>> minutes, either set log.retention.minutes very low (at risk of losing data
>> if consumers need restart), or make sure your disk capacity matches the
>> rates in which producers send data.
>>
>> Gwen
>>
>>
>> On Sat, Feb 7, 2015 at 3:01 AM, Eduardo Costa Alfaia <
>> e.costaalfaia@unibs.it
>> > wrote:
>>
>> > Hi Guys,
>> >
>> > I have some doubts about the Kafka, the first is Why sometimes the
>> > applications prefer to connect to zookeeper instead brokers? Connecting
>> to
>> > zookeeper could create an overhead, because we are inserting other
>> element
>> > between producer and consumer. Another question is about the information
>> > sent by producer, in my tests the producer send the messages to brokers
>> and
>> > a few minutes my HardDisk is full (my harddisk has 250GB), is there
>> > something to do in the configuration to minimize this?
>> >
>> > Thanks
>> > --
>> > Informativa sulla Privacy: http://www.unibs.it/node/8155
>> >
>>
>
>

Re: Doubts Kafka

Posted by Gwen Shapira <gs...@cloudera.com>.
This didn't change in 0.8.2, unfortunately.

What I typically do with high level consumer is read messages into my own
buffer and once I'd done processing them with no errors, I clear my own
buffer, commit offsets and read more messages from Kafka.

This way, if I have errors I can re-try from my buffer. If I crash and the
buffer is gone, the consumer will re-read these messages since offsets were
not committed yet.

Would have been nice if the consumer would have handled this for me, but
managing the buffer is not bad.

Gwen

On Sun, Feb 8, 2015 at 10:38 AM, Christopher Piggott <cp...@gmail.com>
wrote:

> Have there been any changes with 0.8.2 in how the marker gets moved when
> you use the high-level consumer?
>
> One problem I have always had was: what if I pull something from the
> stream, but then I have an error in processing it?  I don't really want to
> move the marker.
>
> I would almost like the client to have a callback mechanism for processing,
> and the marker only gets moved if the high level consumer successfully
> implements my callback/processor (with no exceptions, at least).
>
>
>
> On Sun, Feb 8, 2015 at 9:49 AM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > On Sun, Feb 8, 2015 at 6:39 AM, Christopher Piggott <cp...@gmail.com>
> > wrote:
> >
> > > > The consumer used Zookeeper to store offsets, in 0.8.2 there's an
> > option
> > > to use Kafka itself for that (by setting *offsets.storage = kafka
> > >
> > > Does it still really live in zookeeper, and kafka is proxying the
> > requests
> > > through?
> > >
> > >
> > They don't live in Zookeeper. They live in a secret Kafka topic
> (__offsets
> > or something like that).
> >
> > For migration purposes, you can set dual.commit.enable = true and then
> > offsets will be stored in both Kafka and ZK, but the intention is to
> > migrate to 100% Kafka storage.
> >
> >
> >
> > > On Sun, Feb 8, 2015 at 9:25 AM, Gwen Shapira <gs...@cloudera.com>
> > > wrote:
> > >
> > > > Hi Eduardo,
> > > >
> > > > 1. "Why sometimes the applications prefer to connect to zookeeper
> > instead
> > > > brokers?"
> > > >
> > > > I assume you are talking about the clients and some of our tools?
> > > > These are parts of an older design and we are actively working on
> > fixing
> > > > this. The consumer used Zookeeper to store offsets, in 0.8.2 there's
> an
> > > > option to use Kafka itself for that (by setting *offsets.storage =
> > > kafka*).
> > > > We are planning on fixing the tools in 0.9, but obviously they are
> less
> > > > performance sensitive than the consumers.
> > > >
> > > > 2. Regarding your tests and disk usage - I'm not sure exactly what
> > fills
> > > > your disk - if its the kafka transaction logs (i.e. log.dir), then we
> > > > expect to store the size of all messages sent times the replication
> > > faction
> > > > configured for each topic. We keep messages for the amount of time
> > > > specified in *log.retention* parameters. If the disk is filled within
> > > > minutes, either set log.retention.minutes very low (at risk of losing
> > > data
> > > > if consumers need restart), or make sure your disk capacity matches
> the
> > > > rates in which producers send data.
> > > >
> > > > Gwen
> > > >
> > > >
> > > > On Sat, Feb 7, 2015 at 3:01 AM, Eduardo Costa Alfaia <
> > > > e.costaalfaia@unibs.it
> > > > > wrote:
> > > >
> > > > > Hi Guys,
> > > > >
> > > > > I have some doubts about the Kafka, the first is Why sometimes the
> > > > > applications prefer to connect to zookeeper instead brokers?
> > Connecting
> > > > to
> > > > > zookeeper could create an overhead, because we are inserting other
> > > > element
> > > > > between producer and consumer. Another question is about the
> > > information
> > > > > sent by producer, in my tests the producer send the messages to
> > brokers
> > > > and
> > > > > a few minutes my HardDisk is full (my harddisk has 250GB), is there
> > > > > something to do in the configuration to minimize this?
> > > > >
> > > > > Thanks
> > > > > --
> > > > > Informativa sulla Privacy: http://www.unibs.it/node/8155
> > > > >
> > > >
> > >
> >
>

Re: Doubts Kafka

Posted by Christopher Piggott <cp...@gmail.com>.
Have there been any changes with 0.8.2 in how the marker gets moved when
you use the high-level consumer?

One problem I have always had was: what if I pull something from the
stream, but then I have an error in processing it?  I don't really want to
move the marker.

I would almost like the client to have a callback mechanism for processing,
and the marker only gets moved if the high level consumer successfully
implements my callback/processor (with no exceptions, at least).



On Sun, Feb 8, 2015 at 9:49 AM, Gwen Shapira <gs...@cloudera.com> wrote:

> On Sun, Feb 8, 2015 at 6:39 AM, Christopher Piggott <cp...@gmail.com>
> wrote:
>
> > > The consumer used Zookeeper to store offsets, in 0.8.2 there's an
> option
> > to use Kafka itself for that (by setting *offsets.storage = kafka
> >
> > Does it still really live in zookeeper, and kafka is proxying the
> requests
> > through?
> >
> >
> They don't live in Zookeeper. They live in a secret Kafka topic (__offsets
> or something like that).
>
> For migration purposes, you can set dual.commit.enable = true and then
> offsets will be stored in both Kafka and ZK, but the intention is to
> migrate to 100% Kafka storage.
>
>
>
> > On Sun, Feb 8, 2015 at 9:25 AM, Gwen Shapira <gs...@cloudera.com>
> > wrote:
> >
> > > Hi Eduardo,
> > >
> > > 1. "Why sometimes the applications prefer to connect to zookeeper
> instead
> > > brokers?"
> > >
> > > I assume you are talking about the clients and some of our tools?
> > > These are parts of an older design and we are actively working on
> fixing
> > > this. The consumer used Zookeeper to store offsets, in 0.8.2 there's an
> > > option to use Kafka itself for that (by setting *offsets.storage =
> > kafka*).
> > > We are planning on fixing the tools in 0.9, but obviously they are less
> > > performance sensitive than the consumers.
> > >
> > > 2. Regarding your tests and disk usage - I'm not sure exactly what
> fills
> > > your disk - if its the kafka transaction logs (i.e. log.dir), then we
> > > expect to store the size of all messages sent times the replication
> > faction
> > > configured for each topic. We keep messages for the amount of time
> > > specified in *log.retention* parameters. If the disk is filled within
> > > minutes, either set log.retention.minutes very low (at risk of losing
> > data
> > > if consumers need restart), or make sure your disk capacity matches the
> > > rates in which producers send data.
> > >
> > > Gwen
> > >
> > >
> > > On Sat, Feb 7, 2015 at 3:01 AM, Eduardo Costa Alfaia <
> > > e.costaalfaia@unibs.it
> > > > wrote:
> > >
> > > > Hi Guys,
> > > >
> > > > I have some doubts about the Kafka, the first is Why sometimes the
> > > > applications prefer to connect to zookeeper instead brokers?
> Connecting
> > > to
> > > > zookeeper could create an overhead, because we are inserting other
> > > element
> > > > between producer and consumer. Another question is about the
> > information
> > > > sent by producer, in my tests the producer send the messages to
> brokers
> > > and
> > > > a few minutes my HardDisk is full (my harddisk has 250GB), is there
> > > > something to do in the configuration to minimize this?
> > > >
> > > > Thanks
> > > > --
> > > > Informativa sulla Privacy: http://www.unibs.it/node/8155
> > > >
> > >
> >
>

Re: Doubts Kafka

Posted by Gwen Shapira <gs...@cloudera.com>.
On Sun, Feb 8, 2015 at 6:39 AM, Christopher Piggott <cp...@gmail.com>
wrote:

> > The consumer used Zookeeper to store offsets, in 0.8.2 there's an option
> to use Kafka itself for that (by setting *offsets.storage = kafka
>
> Does it still really live in zookeeper, and kafka is proxying the requests
> through?
>
>
They don't live in Zookeeper. They live in a secret Kafka topic (__offsets
or something like that).

For migration purposes, you can set dual.commit.enable = true and then
offsets will be stored in both Kafka and ZK, but the intention is to
migrate to 100% Kafka storage.



> On Sun, Feb 8, 2015 at 9:25 AM, Gwen Shapira <gs...@cloudera.com>
> wrote:
>
> > Hi Eduardo,
> >
> > 1. "Why sometimes the applications prefer to connect to zookeeper instead
> > brokers?"
> >
> > I assume you are talking about the clients and some of our tools?
> > These are parts of an older design and we are actively working on fixing
> > this. The consumer used Zookeeper to store offsets, in 0.8.2 there's an
> > option to use Kafka itself for that (by setting *offsets.storage =
> kafka*).
> > We are planning on fixing the tools in 0.9, but obviously they are less
> > performance sensitive than the consumers.
> >
> > 2. Regarding your tests and disk usage - I'm not sure exactly what fills
> > your disk - if its the kafka transaction logs (i.e. log.dir), then we
> > expect to store the size of all messages sent times the replication
> faction
> > configured for each topic. We keep messages for the amount of time
> > specified in *log.retention* parameters. If the disk is filled within
> > minutes, either set log.retention.minutes very low (at risk of losing
> data
> > if consumers need restart), or make sure your disk capacity matches the
> > rates in which producers send data.
> >
> > Gwen
> >
> >
> > On Sat, Feb 7, 2015 at 3:01 AM, Eduardo Costa Alfaia <
> > e.costaalfaia@unibs.it
> > > wrote:
> >
> > > Hi Guys,
> > >
> > > I have some doubts about the Kafka, the first is Why sometimes the
> > > applications prefer to connect to zookeeper instead brokers? Connecting
> > to
> > > zookeeper could create an overhead, because we are inserting other
> > element
> > > between producer and consumer. Another question is about the
> information
> > > sent by producer, in my tests the producer send the messages to brokers
> > and
> > > a few minutes my HardDisk is full (my harddisk has 250GB), is there
> > > something to do in the configuration to minimize this?
> > >
> > > Thanks
> > > --
> > > Informativa sulla Privacy: http://www.unibs.it/node/8155
> > >
> >
>

Re: Doubts Kafka

Posted by Christopher Piggott <cp...@gmail.com>.
> The consumer used Zookeeper to store offsets, in 0.8.2 there's an option
to use Kafka itself for that (by setting *offsets.storage = kafka

Does it still really live in zookeeper, and kafka is proxying the requests
through?

On Sun, Feb 8, 2015 at 9:25 AM, Gwen Shapira <gs...@cloudera.com> wrote:

> Hi Eduardo,
>
> 1. "Why sometimes the applications prefer to connect to zookeeper instead
> brokers?"
>
> I assume you are talking about the clients and some of our tools?
> These are parts of an older design and we are actively working on fixing
> this. The consumer used Zookeeper to store offsets, in 0.8.2 there's an
> option to use Kafka itself for that (by setting *offsets.storage = kafka*).
> We are planning on fixing the tools in 0.9, but obviously they are less
> performance sensitive than the consumers.
>
> 2. Regarding your tests and disk usage - I'm not sure exactly what fills
> your disk - if its the kafka transaction logs (i.e. log.dir), then we
> expect to store the size of all messages sent times the replication faction
> configured for each topic. We keep messages for the amount of time
> specified in *log.retention* parameters. If the disk is filled within
> minutes, either set log.retention.minutes very low (at risk of losing data
> if consumers need restart), or make sure your disk capacity matches the
> rates in which producers send data.
>
> Gwen
>
>
> On Sat, Feb 7, 2015 at 3:01 AM, Eduardo Costa Alfaia <
> e.costaalfaia@unibs.it
> > wrote:
>
> > Hi Guys,
> >
> > I have some doubts about the Kafka, the first is Why sometimes the
> > applications prefer to connect to zookeeper instead brokers? Connecting
> to
> > zookeeper could create an overhead, because we are inserting other
> element
> > between producer and consumer. Another question is about the information
> > sent by producer, in my tests the producer send the messages to brokers
> and
> > a few minutes my HardDisk is full (my harddisk has 250GB), is there
> > something to do in the configuration to minimize this?
> >
> > Thanks
> > --
> > Informativa sulla Privacy: http://www.unibs.it/node/8155
> >
>

Re: Doubts Kafka

Posted by Gwen Shapira <gs...@cloudera.com>.
Hi Eduardo,

1. "Why sometimes the applications prefer to connect to zookeeper instead
brokers?"

I assume you are talking about the clients and some of our tools?
These are parts of an older design and we are actively working on fixing
this. The consumer used Zookeeper to store offsets, in 0.8.2 there's an
option to use Kafka itself for that (by setting *offsets.storage = kafka*).
We are planning on fixing the tools in 0.9, but obviously they are less
performance sensitive than the consumers.

2. Regarding your tests and disk usage - I'm not sure exactly what fills
your disk - if its the kafka transaction logs (i.e. log.dir), then we
expect to store the size of all messages sent times the replication faction
configured for each topic. We keep messages for the amount of time
specified in *log.retention* parameters. If the disk is filled within
minutes, either set log.retention.minutes very low (at risk of losing data
if consumers need restart), or make sure your disk capacity matches the
rates in which producers send data.

Gwen


On Sat, Feb 7, 2015 at 3:01 AM, Eduardo Costa Alfaia <e.costaalfaia@unibs.it
> wrote:

> Hi Guys,
>
> I have some doubts about the Kafka, the first is Why sometimes the
> applications prefer to connect to zookeeper instead brokers? Connecting to
> zookeeper could create an overhead, because we are inserting other element
> between producer and consumer. Another question is about the information
> sent by producer, in my tests the producer send the messages to brokers and
> a few minutes my HardDisk is full (my harddisk has 250GB), is there
> something to do in the configuration to minimize this?
>
> Thanks
> --
> Informativa sulla Privacy: http://www.unibs.it/node/8155
>